AD-A07e  690  PURDUE  UNIV  LAFAYETTE  IN  SCHOOL  OF  ELECTRICAL  ENGINEERING  F/6  9/2 

DESIGN  AND  ANALYSIS  OF  INTERCONNECTION  NETWORKS  FOR  PARTITI0NAB**ETC (U) 
AUG  79  S  D  SMITH  »  H  J  SIEGEL  AF0SR-76-S581 

UNCLASSIFIED  TR«EE79-39  AF0SR-TR»79-1972  Mt 


ADA0'?8690 


DESIGN  AND  ANALYSIS  or  / 
INTERCONNECTION  NETWORKS 
rOR  PARTinONARLE  PARALLEL 
PROCESSING  SYSTEMS 


8.  DIgbc  Soilth 
Ho  word  loy  Slogol 


School  oi  Bloctricol  EBRlBoorlBG 
FBrdBO  UBloorflty 
Woft  Lofoyotto,  iBdloBO 


This  work  was  supported  in  part  by  the  Air  Force  Office  of  Scientific  Research,  Air 
Force  Systems  Command,  USAF,  under  grant  number  AFOSR-78-358I  and  by  the 
Purdue  Research  Foundation  under  a  David  Ross  Grant  (PRF  #8718). 


Apprmrad  for  pobllo  r# 

distribution  unliaitod 


dCCUHl  I  1' 


I7:ieeBI.D0CUMENTATI0N  rAGE 


fc^9-  1272 


2.  GOVT  ACCESSION  NO. 


READ  INSTRUCTIONS 

_ BEFORE  COMPLETIN'r,  KORV.' 

T  RECIB*«^5  CATALOG  NUMBER 


Ay. ^TLE  fAnd  Subtill.p  _ _ _  -  „ _ 

■^^isiGN  AND  ANALYSIS  OF  INTERCONNECT  ION  JJETWORKS 
;  FOR  PARTITIONABLE  PARALLEL  PROCESSING  SYSTEMS* 


^S.  OF  REPORT  ft  PERIOD  CQVE^D 


/Interim 


TR-EE  79-39 


T*  1  11— AO  I  HUAI  i)  I 

S.  Diane /Smith  j 
Howard  JayAiegeli 


AFOSR-78-3581 


rNUMBERfs; 


9.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

School  of  Electrical  Engineerings 

Purdue  University 

West  Lafayette,  IN  ^7907 


10.  PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  A  wpjNC  W^IT  NU^BEgS 


61102F 


n.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 


I  Air  Force  Office  of  Scientific  Research/WTI  V _ ^ 

\  Bolling  AFB, 

Washington.  DC  201^2 _ 

U.  MONITORING  AGENCY  NAME  4  AODHESSf/l  dlllerenl  from  Controllinc  Office) 


Aug«^  Vf>79 


^S.  SECURITY  CLASS,  (oi  this  report) 


Unclassified 


I5«.  DECLASSIFICATION  DOWNGRADING 
SCHEDULE 


16.  distribution  STATEMENT  (oi  tMa  Roport) 


Approved  for  public  release;  distribution  unlimited. 


I  !7.  distribution  STATEMENT  (oi  the  ebefrect  entered  tn  Btock  20,  H  different  from  Report) 


18.  supplementary  notes 


1 19.  key  words  f Continue  on  teverao  eide  If  neceeaery  and  identify  by  block  number) 


computer  architecture,  cube  network,  dynamically  reconf igurable  systems, 
image  processing,  interconnection  networks,  MIMD  machine,  multimicroprocessor 
systems,  parallel  processing,  parti tionable  systems,  PM2I  network,  shuffle- 
exchange,  SIMD  macnine. 


20.  abstract  (Continue  on  reveree  aide  If  necesamry  end  Identify  by  btock  number) 

Please  see  reverse  side. 


DO  1473 


J?  9Z  COC 


UNCLASSIFIED 


btLUHtTY  CcAiblFiC 

!  ■  '  20.  'abstract 


A  single  instruction  stream  -  multiple  data  stream  (SIMD)  computer 
performs  one  algorithm  (single  instruction  stream)  on  vectors  of  data 
(multiple  data  stream).  The  model  of  an  SIMD  machine  used  here  consists 
of  a  control  unit  (CU),  processing  elements  (PEs)/  and  an  interconnec¬ 
tion  network.  The  CU  broadcasts  instructions  to  the  N  PEs  (where  N  is  a 
power  of  two).  The  interconnection  network  is  the  mechanism  that  allows 
PEs  to  pass  data  among  themselves. 


Four  types  of  interconnection  networks  are  discussed  in  this  work: 
the  Shuf f le-Exc^nge  network^  the  Cube  network,  the  ILLIAC  network,  and 
the  Plus-Minus  2^'  (PM2I)  network.  Each  type  has  been  discussed  in  the 
literature  and  used  in  an  existing  or  proposed  machine  design. 

For  each  of  these  four  network  types,  different  hardware  structures 
are  considered.  A  recirculating  network  consists  of  one  stage  of 
switches  that  is  reused  until  the  data  reach  their  final  destinations. 
A  combinational  logic  multistage  network  consists  of  several  stages  of 
switches  and,  usually,  data  is  transferred  in  one  pass  through  the  net¬ 
work.  In  pipelined  multistage  networks,  which  are  introduced,  registers 
are  inserted  after  each  stage  of  a  combinational  logic  multistage  net¬ 
work.  The  data  are  divided  into  segments,  and  these  segments  are  passed 
in  a  parallel-pipelined  manner.  Hardware  implementations  for  recircu¬ 
lating,  combinational  logic  multistage,  and  pipelined  multistage  net¬ 
works  are  presented  and  analyzed. 

Some  problems  may  be  more  efficiently  solved  if  a  large  SIMD 
machine  can  be  partitioned  into  smaller  groups  of  varying  sizes  of 
powers  of  two.  The  interconnection  network  must  be  able  to  support  this 
partitioned  machine.  The  partitioning  properties  of  the  four  types  of 
networks  are  presented. 

In  the  selection  of  an  interconnection  network  for  a  computer 
design,  the  types  of  algorithms  that  must  be  executed  should  be  con¬ 
sidered.  A  detailed  analysis  of  various  networks  is  presented  for  three 
parallel  image  processing  algorithms:  smoothing,  histogram  formation, 
and  data  classification. 

The  Augmented  Data  Manipulator  (ADM)  network  is  introduced.  An 
analysis  is  presented  which  compares  the  capabilities  of  the  ADM  with 
those  of  a  multistage  Cube  network  and  with  the  Inverse  Augmented  Data 
Manipulator  network. 

The  Emulator  System,  a  proposed  hardware  design  aid,  is  introduced. 
The  flexibility  and  power  of  this  system  is  demonstrated  by  its  ability 
to  simulate  many  types  of  interconnection  networks  and  control  scheme? 
that  have  appeared  in  the  literature. 

As  the  costs  of  microprocessors  continue  to  decrease,  more  large 
scale  multiprocessor  systems  are  being  proposed  and  built.  This  thesis 
will  aid  system  architects  in  designing  a  partitionable  interconnection 
net'jork  appropriate  for  their  particular  needs. 


UNCLASSIZIED. 


DESIGN  AND  ANALYSIS  OF 


INTERCONNECTION  NETWORKS 


PARTITIONABLE  PARALLEL  PROCESSING  SYSTEMS 


S.  Diane  Smith 


Howard  Jay  Siegel 


Purdue  University 


School  of  Electrical  Engineering 


West  Lafayette,  IN  47907 


Purdue  University 


approved  lor  ioi.  -.-j  — -  v 

Distribution  is  uuiinitcil. 

A.  D.  BLOSH 

Technical  Information  Officer 

This  work  was  supported  in  part  by  the  Air  Force  Office  of  Scientific 
Research,  Air  Force  Systems  Command,  USAF,  under  grant  number 
AFOSR-78-3581  and  by  the  Purdue  Research  Foundation  under  a  David  Ross 
Grant  <PRF#8718). 


ACKNOWLEDGEMENTS 


The  author  wishes  to  thank  the  following  people  for  their  comments 
on  the  content  of  this  work:  Prof.  K.  S.  Fu^  Prof.  A.  P.  Reeves,  Prof. 
C.  Smith,  Prof.  L.  J.  Siegel,  R.  J.  McMillen,  and  P.  T.  Mueller,  Jr. 
This  work  was  supported  in  part  by  the  Air  Force  Office  of  Scientific 
Research,  Air  Force  Systems  Command,  USAF,  under  grant  number 
AFOSR-78-3581  and  by  the  Purdue  Research  Foundation  under  a  David  Ross 
Grant  (PRF#8718). 

Portions  of  this  work  have  appeared  in  the  following  papers: 


H.  J.  Siegel  and  S.  Diane  Smith,  "Study  of  Multistage  SIMD  Interconnec¬ 
tion  Networks,"  Proceedings  of  the  Fifth  Annual  Symposium  on  Computer 
Architecture,  pp.  223-229,  April  1978. 

S.  Diane  Smith  and  H.  J.  Siegel,  "Recirculating,  Pipelined,  and  Multis¬ 
tage  SIMD  Interconnection  Networks,"  Proceedings  of  the  1978 
International  Conference  on  Parallel  Processing,  pp.  206-214,  August 
1978. 

S.  Diane  Smith  and  H.  J.  Siegel,  "An  Emulator  Network  for  SIMD  Machine 
Interconnection  Networks,"  Proceedings  of  the  1979  International 
Symposium  on  Computer  Architecture,  pp.  232-241,  April  1979. 

H.  J.  Siegel,  Leah  J.  Siegel,  Robert  J.  McMillen,  Philip  T.  Mueller, 
Jr.,  and  S.  Diane  Smith,  "An  SIMD/MIMD  Multimicroprocessor  System  for 
Image  Processing  and  Pattern  Recognition,"  Proceedings  of  the  1979  IEEE 
Computer  Society  Conference  on  Pattern  Recognition  and  Image  Processing, 
pp.  214-224,  August  1979. 


Accession  For 


HTIS  GliiASil 

EOC  TAB 

□ 

'  Uncniiou;’csd 

J 

Justification _ 

By 

_pistri'''nt  lor/ 

Ayr  ji  r'.hiTity 

(Avail  f?i'0/or 


Dlst 


A 


special 


7 


iii 


TABLE  OF  CONTENTS 


LIST  OF  TABLES . v1 

LIST  OF  FIGURES . . . v111 

LIST  OF  ALGORITHMS . xvii 

ABSTRACT . xvlil 


I.  INTRODUCTION .  1 


II.  DEFINITIONS .  5 


1 1.1.  SIMD  Computers .  5 

11. 2.  Interconnection  Networks . 11 

11. 3.  Network  Structures . . . 21 

11. 4.  PE  Address  Masks . 26 


III.  LITERATURE  REVIEW . 28 


IV.  NETWORK  STRUCTURES . 39 


IV. 1.  Introduction . 39 

IV. 2.  Hardware  Implementations . 40 

IV. 3.  The  Shuffle-No  Shuffle-Exchange  Network . 53 

IV. 4.  Combinational  Logic  Multistage  Networks 


and  Pipelined  Multistage  Networks. 
IV. 5.  Equal  Cost  Combinational  Logic  and 


Pipelined  Multistage  Networks . 67 

IV. 6.  Average  Data  Transfer  Times . 73 


iv 


TABLE  OF  CONTENTS 


PAGE 


IV. 7.  k-stage  Pipelined  Multistage  Networks......... . 75 

IV.8.  Custom  Integrated  Circuit  Implementations . 83 

IV. 9.  Conclusions . 96 

V.  PARTITIONING  INTERCONNECTION  NETWORKS . 97 

V. 1.  Introduction . 97 

V.2.  Partitioning  Multistage  Networks . 102 

V.3.  Partitioning  Recirculating  Networks . 133 

V.4.  Multiple  Control  Partitioning  and 

Variable  Size  Partitions.. . 139 

V. 5.  Conclusions . 141 

VI.  THE  AUGMENTED  DATA  MANIPULATOR  NETWORK . 144 

VI.  1.  Introduction . 144 

VI. 2.  The  ADM  and  the  Generalized  Cube  Network . 145 

VI. 3.  Capabilities  and  Limitations  of  the  ADM . 155 

VI. 4.  Some  Theoretical  Results . 166 

VI.  5.  Conclusions..... . 168 

VII.  PARALLEL  ALGORITHMS  AND  INTERCONNECTION  NETWORKS . 169 

VII.  1.  Introduction . 169 

VII. 2.  Image  Processing:  Data  Distribution . 171 

VII. 3.  Image  Processing:  A  Smoothing  Algorithm . 172 


V 


TABLE  OF  CONTENTS 

PAGE 

VII. 4.  Image  Processing:  Thresholding . 197 

VII. 5.  Image  Data  Classification . 206 

VII.  6.  Conclusions . 213 

VIII.  AN  EMULATOR  NETWORK  FOR 

SIMD  MACHINE  INTERCONNECTION  NETWORKS . 215 

VIII.  1.  Introduction . 215 

VIII. 2.  The  Emulator  Network . 217 

VIII. 3.  Simulation  Results:  Multistage  Networks . 222 

VIII. 4.  Simulation  Results:  Recirculating  Networks . 242 

VIII. 5.  Other  Useful  Permutations . 244 

VIII. 6.  Full  Access  Networks . 251 

VIII. 7.  Partitioning . 254 

VIII. 8.  The  Emulator  Network  and  MIMD  Machines . 256 

VIII. 9.  Conclusions . 263 

IX.  CONCLUSION . 265 

LIST  OF  REFERENCES . 269 


vi 


LIST  OF  TABLES 


TABLE 


PAGE 


IV.I  Definitions  and  abbreviations . 41 

IV.2  S  ^  S 

'pm 

Tm,  and  Tp  for  equal  cost  networks . 69 

IV. 3  Minimum  value  of  k,  the  number  of  stages  in  a  pipelined 
multistage  network^  given  S,  the  number  of 
segments  of  data  to  be  passed^  and 

the  corresponding  value  of  Tk . 76 

IV.  4  For  two  equal  cost  multistage  networks, 

one  combinational  logic  and  one  a  k-stage  pipeline, 

S.  =  S  *  (  n  +  k  -  1  )  /  . . 78 

k  m 

Figures  of  merit  for  combinational  logic  multistage 

networks  for  N  =  1024  and  VJ  =  16 . 

Figures  of  merit  for  pipelined  multistage 
networks  for  N  =  1024  and  W  =  16 . 

V. 1  The  data  manipulator  network  can  exchange  two  data  items 

if  the  following  controls  are  applied 

at  stage  i,  0  £  i  <  n . 121 

V.2  Partitioning  based  on  consecutive  PEs 

using  the  augmented  data  manipulator  network . 125 

V.3  Summary  of  the  partitioning  properties  of  the  networks 

discussed  in  Chapter  V.  N  =  2^  PEs  are  partitioned  into 
into  2*^  ^  groups  of  2^  PEs,  0  <  r  <  n . . . 143 


IV.5 


IV. 6 


'W*'" 


vii 


LIST  OF  TABLES 

TABLE  PAGE 

VI.  1  For  N=4,  the  ADM  forms  all  24  possible 

permutations  of  4  PE  addresses . 149 

\ 

VII. 1  The  recirculating  Cube  network  requires  the  given 

number  of  passes  to  execute  the  interconnection  functions 

needed  for  a  parallel  smoothing  algorithm . 178 

VII. 2  A  combinational  logic  cube  network  requires  the 

number  of  passes  listed  to  accomplish  the  interconnection 

functions  for  a  parallel  smoothing  algorithm . 184 

VII. 3  A  5-stage  pipelined  multistage  cube  network  passes  data  for 

a  parallel  smoothing  algorithm  in  the  times  shown  above.. ..185 
VII. 4  A  recirculating  PH2I  network  passes  data 

for  a  parallel  smoothing  algorithm  as  shown  above . .....186 

VII. 5  A  5-stage  pipelined  multistage  PM2I  network  passes 
data  for  a  parallel  smoothing  algorithm 

according  to  the  table  above . 188 

VII. 6  Summary  of  data  transfer  times  for 

different  networks  for  a  parallel  smoothing  algorithm . 190 

VII. 7  Data  transfer  times  for  various  networks 

for  a  parallel  histogram  formation  algorithm . 204 

VII. 8  Delay  times  for  data  transfer  for  the  image 

classification  algorithm,  where  M  is  the  number  of  classes 
and  D  =  dms  =  dm  =  dr  =  drn . 211 

- » 

/ 


vlii 


LIST  OF  FIGURES 


FIGURE  PAGE 

11. 1  A  PE-to-PE  model  of  an  SIMD  machine  for  N  =  2*^ . 6 

1 1. 2  A  processor-to-memory  model  of  an 

SIMD  machine  for  N  =  2*^........ .  7 

11. 3  A  bus  structure  for  connecting  N  PEs . 12 

11. 4  A  crossbar  matrix  switch  for  connecting  N  PEs . 13 

11. 5  For  N  =  8^  the  Cube  interconnection  functions 
can  be  viewed  geometrically  as  connecting  the 

eight  vertices  of  a  3  dimensional  cube . 15 

11. 6  The  Shuffle  and  Exchange  functions  for  N  =  8 . 17 

11. 7  The  PM2^^  interconnection  functions  connect  PE(i)  to 
PE(i+2  modulo  N)^  and  the  PM2_.  interconnection 
functions  connect  PE(i)  to  PE(i-2  modulo  N)^ 

0  <  i  <  N.  Here^  N  =  8 . 19 

11. 8  For  N  =  16^  the  Illiac  network  connects 

PE(i)  to  PE(i+1  modulo  16)^  to  PE(i-1  modulo  16)^ 

to  PE(i+4  modulo  16),  and  to  PE(i-4  modulo  16).... . ...20 

11. 9  A  model  for  a  recirculating  network  for  PE(i) . 22 

11.10  A  model  for  a  multistage  interconnection 

network  for  PE(i) . 23 

11.11  The  interchange  box  is  a  two-input  two-output 
device  that  can  be  in  one  of  four  legitimate  states: 
straight,  exchange,  lower  broadcast,  or  upper  broadcast ... .24 


T 


LIST  OF  FIGURES 


FIGURE  PAGE 

111.1  The  prografflfflable  switching  network  CFS77T  is  a  Benes 

rearrangeable  switching  network,  shown  here  for  N  =  8 . 29 

111. 2  A  bitonic  sorter  for  an  arbitrary  sequence 

of  elements  [ISIE78b1 . 30 

111. 3  The  omega  network  is  a  log^  N  stage  Shuffle-Exchange 

network  with  four  function  interchange  boxes  FLAUTS] . 32 

111. 4  The  STARAN  flip  network  for  N  =  8  FBATbD. 

(a)  The  flip  control  is  individual  stage  control. 

(b)  The  shift  control  is  partial  stage  control . 34 

111. 5  The  indirect  binary  n-cube  is  a  log^  N  stage  cube 

network,  shown  here  for  N  =  8.  The  first  stage  forms 
Cubejj,  he  second  Cube^,  and  the  last  Cube2  CPEA77D . 35 

111. 6  The  generalized  cube  network,  shown  here  for  N  =  8,  is  a 
log^  N  stage  cube  network.  The  first  stage,  stage  2, 
forms  Cube2^  the  next  forms  Cube^,  and 

the  last  forms  Cube^ . 36 


LIST  OF  FIGURES 


FIGURE 

III. 7 


IV.  1 

IV.2 
IV.3 
IV. 4 

IV.5 

IV. 6 

IV. 7 

IV.8 


PAGE 


The  data  manipulator  network  (IFE743^  shown  for 
N  =  8,  is  a  1092  N  Pf12l  network. 

The  dashed  Lines  represent  the  U  control  line  interconnections 
The  dotted  lines  represent  the  H  control  line  interconnections 
The  solid  lines  represent  the  D  control  line  interconnections. 


For  stage  i,  U^,  and  control  those  cells 
whose  i^^  bit  is  0,  and  U2/  O2,  and  H2 

control  those  cells  those  i^^  bit  is  1 . . . 37 

An  interchange  box. 

If  Exchange  =  0,  then  DO  .  =  DO.  and  D1  =  D1.  . 

®  '  out  T  n  out  1 n 

If  Exchange  =  1,  then  DO  .  =  D1.  and  D1  =  DO . 42 

^  '  out  1  n  out  1  n 

An  inverting  interchange  box  using  AND-OR-INVERT  logic . 44 

A  PM2I  module  for  row  k  rFE74D . 46 

A  circuit  for  a  recirculating  Shuffle-Exchange 

network  for  PE(i),  0  <  i  <  N . . . 47 

A  circuit  for  a  recirculating  Cube  network 

for  PE(i)^  0  £  i  <  8 . 48 

A  recirculating  PM2I  network 

for  PE(i),  0  <  i  <  8,  M  =  8 . 49 

Break  even  points  for  a  multistage  Cube 

network  versus  a  recirculating  Cube  network . 51 

A  Shuffle-No  Shuffle-Exchange  circuit  for 

row  i^  0  <  i  <  N . 55 


XT 


LIST  OF  FIGURES 


FIGURE  PAGE 

IV.9  A  model,  for  a  pipelined  multistage  network  of  width  B 

for  N  PEs  and  data  word  Dy_.j...0^0Q . 62 

IV.10  Tp  vs.  Tn  for  N  =  16 . 64 

IV.11  Tp  vs  Tm  for  N  =  64 . 65 

IV.  12  Tp  vs  Tm  for  N  =  1024 . 66 

IV. 13  Cost  vs  delay  for  equal  cost  pipelined 

and  combinational  logic  networks  for  N  =  16 . 70 

IV. 14  Cost  vs  delay  for  equal  cost  pipelined 

and  combinational  logic  networks  for  N  =  128 . 71 

IV.15  Cost  vs  delay  for  equal  cost  pipelined  and 

combinational  logic  networks  for  N  =  1024 . 72 

IV. 16  A  generalized  cube  network  for  N  =  4 . 85 

IV. 17  Eight  interchange  boxes  from  stages  i  and 

i-1  of  a  generalized  cube  network . 87 


IV. 18  A  circuit  for  line  P  at  stage  i  selects  data  from 
line  P,  line  (P  +  2^)  modulo  or  line 
(P  -  2^)  modulo  N  of  stage  i+1.  The  multiplexer 
selects  some  input  to  become  the  output  P^^^ 

based  on  the  set  of  control  signals  input  to  it . 90 

IV. 19  At  stage  i  of  a  multistage  PM2I  network, 
two  multiplexers,  P  and  P’,  can  be 

constructed  on  one  integrated  circuit  chip . 91 


FIGURE 

IV. 20 

IV.  21 

V. 1 

V.2 

V.3 


xii 


LIST  OF  FIGURES 


PAGE 

The  circuits  for  multiplexer  P  at 

stage  i  and  multiplexer  P  at  stage  i-1 . 92 

One  multiplexer  behaves  the  same  as  two  stages,  stages  i 
and  i-1,  of  the  network  for  position  P.  The  equivalent 
circuit  receives  input  from  stage  i+1  and  outputs  data 
to  stage  i-2.  It  selects  data  from  one  of  seven  places. ...95 
For  N  =  2*^,  single  control  partitioning 
supplies  the  same  instruction  stream  to 

2*^  SIMD  machines  of  size  2^..... . 99 

For  N  =  2^,  multiple  control  partitioning  implies 

n—  p 

multiple  control  units.  Each  of  2  control  units 
commands  2^  PEs.  Two  control  units  can  be  linked 

to  form  an  SIMD  machine  of  size  2  . 100 

For  N  =  16,  a  multistage  Shuffle-Exchange  network  can  be 
partitioned  into  four  groups  of  four  PEs  such  that 
for  all  PEs  in  a  group  the  two  most  significant 

PE  address  bits  are  the  same . . . .....104 

A  multistage  Shuffle-Exchange  network  can  be  partitioned 

into  two  groups  of  four  PEs  each  by  setting 

one  stage  to  no  exchange.  Group  G1  is  the  four  PEs  0,  1,  2, 


and  3.  Group  62  is  the  four  PEs  4,  5,  6,  and  7 


106 


X  i  i  i 


< 


LIST  OF  FIGURES 


FIGURE 


If  PEs  0,  1,  2,  and  3  use  the  network,  then  a 
generalized  cube  network  uses  the  controls  above 

to  affect  data  passage  through  the  network . 110 

An  indirect  binary  n-cube  network  controls  data 

from  PEs  0,  1,  2,  and  3  using  the  controls  above . Ill 

For  N  =  8,  a  multistage  PM2I  network  can  be 
partitioned  into  two  groups  of  four  PEs. 

Group  A  is  PEs  0,  2,  4,  and  6. 


Group  B  is  PEs  1,  3,  5,  and  7. 


PM2^^  is  a  missing  end-around  connection  for  PE(2) 
of  partition  0.  It  moves  data  from  PE (2)  to  PE (4)  in 
partition  1  instead  of  moving  the  data 

to  PE(0>  of  partition  0 . 127 

Applying  PM2_-,  to  PE(2)  before  applying  corrects 

for  the  missing  end-around  PM2^^  connection  for  PE (2) . 128 

A  missing  end-around  connection  for  PE(2) 

is  corrected  by  a  missing  end-around  Pfi2_Q 

connection  applied  to  the  data  from  PE (2) . 130 

Applying  PM2_2  to  PE (2)  before  applying  PM2^^ 
and  PW2^g  causes  data  from  PE(2)  to  be  sent 


to  PE(  2  +  2  +  1  modulo  4  )  =  PE(1)  of  partition  0. 


Four  PEs  out  of  16  grouped  by  an  Illiac  connection . 135 


xiv 


LIST  OF  FIGURES 

FIGURE  PAGE 

V.13  An  ILllac  network  for  N  =  4,  m  =  2 . . . 135 

V. H  A  generalized  cube  can  be  partitioned  into  groups  of 

different  sizes  when  all  sizes  are  powers  of  two.  Here^ 

GO  and  G1  are  of  size  2,  and  G2  is  of  size  4 . 140 

VI. I  For  N  =  8,  the  ADM  can  perform  the 

perfect  shuffle  permutation . 153 

VI. 2  Stage  1  of  the  ADM  for  N  =  8  can  conditionally  exchange 

data  between  PEs  P  and  (P+2)  modulo  N  for  P  =  XOO . 156 

VI. 3  Stage  1  of  the  ADM  can  conditionally  exchange 

data  between  PEs  P  and  (P+2)  modulo  N  for  P  =  XiO . 157 

VI.  4  The  Inverse  ADM  (lADM)  is  a  multistage  PM2I  network  where 

the  first  stage  conditionally  performs  PM2^p^  the  second, 
PM2^^,  and  so  forth,  and  the  last  stage  . 162 

VII.  1  Data  transfer  times  vs  number  of  pixels 

per  PE  for  Algorithm  VII.  1 . . . 191 

VII. 2  A  recursive  doubling  algorithm  sums  8  data  items 
in  3  steps,  where  each  step  is  a  data  transfer 
among  PEs  followed  by  an  addition . 199 


XV 


LIST  OF  FIGURES 

FIGURE  PAGE 

VII.  3  A  modified  recursive  doubling  algorithm  for 

histogram  formation. 

Let  Ai  be  the  HIST(0  127)  data  from  PE(i). 

Let  Bi  be  the  HIST(128  ->  255)  data  from  PE(i). 

Let  Ai^j  be  the  sum  Ai  +  A(i+1)  +  ...  +  Aj . 

Let  Bi,i  be  the  sum  Bi  +  B(i+1)  +  ...  ♦  Dj . 201 

VIII.  1  A  node  of  the  emulator  network  for  PE(i)  for  N  =  2^  PEs 

can  be  built  using  N  (  2  log^  N  +  1)  tristate  buffers. 


Control  signals  are  provided  by  a  register  in  PE(i) . 218 

VIII. 2  Architecture  of  the  Emulator  System  for  PE(i) . 219 

VIII. 3  An  SW  banyon  network  and  its  graph  for  N  =  8  CGL73]... . 224 

VIII. 4  A  Shuffle-Exchange  network  with  simplified  control 
for  N  =  8.  Dotted  lines  represent  paths 


of  control  signals.  The  control  bits  for  stage  i  are 

calculated  from  those  of  the  previous  stage  IILAST76] . 231 

VIII. 5  A  Shuffle-Exchange  network  with  simplified  control 

can  realized  a  cyclic  shift  of  three  when  the  specified 
boolean  function  for  calculating  the  control  bits  is  the 
exclusive  or  function  CLAST76D . 232 


LIST  OF  FIGURES 


FIGURE  PAGE 

\/III.6  The  graph  of  a  CC  banyon  network  for 

S  =  F  =  2  and  N  =  8  CGLTBD  connects  Vi|^  at  stage  k 
to  Vjj^  at  stage  k+1  if  and  only  if 

j  =  i  or  j  =  (i  +  2*^)  modulo  0  £  k  <  n  . 239 

VIII. 7  The  perfect  shuffle  and  the  inverse  perfect 

shuffle  for  N  =  8 . 2A5 

VIII. 8  A  PSN  for  N  =  8^  and  the  control  bits 

for  each  interchange  box  rFS77] . 252 

VIII. 9  For  N  =  8,  to  simulate  an  indirect  binary  n-cube  that 
is  partitioned  into  two  groups  based  on  the 
most  significant  PE  address  bits,  the  emulator  system 

simulates  the  two  stage  network  above . 255 

VIII. 10  The  CHoPP  is  a  proposed  MIHD  computer  system. 

The  interconnection  network  routes  data 

between  processing  elements  CSUB77D . 256 

VIII. 11  The  half-ring  tree  structure  has  been  suggested 

for  connecting  processors  in  the  X-tree  system  CSDP78D.....259 
VIII. 12  The  full-ring  tree  structure  has  been  proposed 

for  connecting  processors  in  the  X-tree  system  CSDP78D . 260 

VIII. 13  N  =  31  PEs  can  be  connected  in  a  full-ring  structure  where 

the  addresses  at  each  level  are  as  given . 261 


XVI  i  1 


ABSTRACT 

A  single  instruction  stream  -  multiple  data  stream  (SIMD)  computer 
performs  one  algorithm  (single  instruction  stream)  on  vectors  of  data 
(multiple  data  stream).  The  model  of  an  SIMD  machine  used  here  consists 
of  a  control  unit  (CD),  processing  elements  (PEs),  and  an  interconnec¬ 
tion  network.  The  CU  broadcasts  instructions  to  the  N  PEs  (where  N  is  a 
power  of  two).  The  interconnection  network  is  the  mechanism  that  allows 
PEs  to  pass  data  among  themselves. 

Four  types  of  interconnection  networks  are  discussed  in  this  work: 
the  Shuffle-Exchange  network^  the  Cube  network,  the  ILLIAC  network,  and 
the  Plus-Minus  2’  (PM2I)  network.  Each  type  has  been  discussed  in  the 
literature  and  used  in  an  existing  or  proposed  machine  design. 

For  each  of  these  four  network  types,  different  hardware  structures 
are  considered.  A  recirculating  network  consists  of  one  stage  of 
switches  that  is  reused  until  the  data  reach  their  final  destinations. 
A  combinational  logic  multistage  network  consists  of  several  stages  of 
switches  and,  usually,  data  is  transferred  in  one  pass  through  the  net¬ 
work.  In  pipelined  multistage  networks,  which  are  introduced,  registers 
are  inserted  after  each  stage  of  a  combinational  logic  multistage  net¬ 
work.  The  data  are  divided  into  segments,  and  these  segments  are  passed 
in  a  parallel-pipelined  manner.  Hardware  implementations  for  recircu¬ 
lating,  combinational  logic  multistage,  and  pipelined  multistage  net¬ 
works  are  presented  and  analyzed. 


3 


xix 

Some  problems  may  be  more  efficiently  solved  if  a  large  SIMD 
machine  can  be  partitioned  into  smaller  groups  of  varying  sizes  of 
powers  of  two.  The  interconnection  network  must  be  able  to  support  this 
partitioned  machine.  The  partitioning  properties  of  the  four  types  of 
networks  are  presented. 

In  the  selection  of  an  interconnection  network  for  a  computer 
design,  the  types  of  algorithms  that  must  be  executed  should  be  con¬ 
sidered.  A  detailed  analysis  of  various  networks  is  presented  for  three 
parallel  image  processing  algorithms:  smoothing,  histogram  formation, 
and  data  classification. 

The  Augmented  Data  Manipulator  (AOM)  network  is  introduced.  An 
analysis  is  presented  which  compares  the  capabilities  of  the  ADM  with 
those  of  a  multistage  Cube  network  and  with  the  Inverse  Augmented  Data 
Manipulator  network. 

The  Emulator  System,  a  proposed  hardware  design  aid,  is  introduced. 
The  flexibility  and  power  of  this  system  is  demonstrated  by  its  ability 
to  simulate  many  types  of  interconnection  networks  and  control  schemes 
that  have  appeared  in  the  literature. 

As  the  costs  of  microprocessors  continue  to  decrease,  more  large 
scale  multiprocessor  systems  are  being  proposed  and  built.  This  thesis 
will  aid  system  architects  in  designing  a  partitionable  interconnection 
network  appropriate  for  their  particular  needs. 


■  ( 
I! 


1 


I.  INTRODUCTION 


Two  basic  factors  influence  the  speed  of  operation  of  a  computer 
system.  First  is  the  speed  of  the  Logic  circuits.  Future  technology 
promises  to  bring  this  into  the  picosecond  delay  range,  but  theoretical 
limitations,  such  as  the  speed  of  light,  mean  that  other  methods  should 
be  used  to  increase  computational  speed.  The  second  factor,  then,  is 
the  organization  of  the  machine  and  the  algorithms  which  it  performs. 

Inexpensive  microprocessors  have  made  large  scale  parallel  process¬ 
ing  systems  feasible.  Such  architectures  can  be  used  for  problems  that 
can  be  broken  into  independent  subtasks,  which  can  be  done  simultane¬ 
ously,  thus  increasing  computational  speed.  Examples  of  problems  that 
benefit  from  parallel  processing  systems  are  weather  forecasting,  image 
processing,  and  air  traffic  control. 

One  tyoe  of  parallel  architecture  is  an  SIND  (single  instruction 
stream  -  multiple  data  stream)  system.  The  main  components  of  this  tyoe 
of  system  are 

(1)  N  processing  elements  which  operate  in  parallel,  all  executing  the 
same  instruction  at  the  same  time, 

(2)  one  control  unit  (CU),  which  sends  instructions  and  other  control 
information  to  the  processing  elements,  and 


C')  an  interconnection  network^  which  allows  the  processing  elements  to 
communicate  among  themselv/es. 

^any  questions  remain  unanswered  about  the  design  and  use  of  inter¬ 
connection  networks.  This  research  formulates  design  criteria  and 
analysis  techniques  for  interconnection  networks  with  emphasis  placed  on 
a  class  of  multiprocessor  systems  which  we  call  partitionable  parallel 
processing  systems.  A  partitionable  parallel  processing  system  is  a 
reconf igurable  parallel  computer  which  can  be  configured  not  only  as  one 
SI»1D  machine  with  N  processing  elements,  but  also  as  many  smaller  SIHD 
machines.  If  T  tasks  each  require  at  most  N/T  processing  elements,  then 
this  multiple  SIMO  mode  more  fully  utilizes  the  system. 

Chapter  II  introduces  the  terminology  of  parallel  systems  and 
interconnection  networks  which  will  be  used  throughout  this  research.  A 
survey  of  the  background  literature  in  the  field  of  interconnection  net¬ 
works  is  presented  in  Chapter  III.  Some  of  the  networks  which  are  dis¬ 
cussed  and  analyzed  in  this  work  are  introduced  in  Chapter  III. 

Chapter  IV  considers  various  aspects  of  the  structure  of  networks 
and  the  circuits  used  to  build  them.  Single  stage  networks  and  multi¬ 
stage  networks  of  combinational  logic  are  designed.  Ry  inserting  regis¬ 
ters  after  stages  of  a  multistage  network,  blocks  of  data  can  be  pipe¬ 
lined  through  the  network  to  improve  the  effective  throughout  of  the 
data  transfer.  The  effects  of  pipelining  on  the  cost  and  data  transfer 
time  of  the  network  are  analyzed.  Comparisons  are  made  between  pipe¬ 
lined  and  unpipelined  (combinational  logic)  multistage  networks. 

The  partitioning  of  an  interconnection  network  into  independent 
subnetworks  is  discussed  in  Chapter  V.  This  allows  a  single  set  of 


processors  to  act  as  many  independent  SIMD  machines.  The  partitioning 
properties  of  single  stage,  multistage,  and  pipelined  networks  are 
analyzed.  The  capabilities  and  restrictions  imposed  by  partitioning  are 
investigated. 

In  Chapter  VI,  an  enhanced  network,  the  Augmented  Data  Manipulator, 
is  analyzed.  The  Augmented  Data  Manipulator  is  a  highly  flexible  multi¬ 
stage  network.  Its  capabilities  are  compared  with  other  networks,  and 
some  group  theoretic  properties  of  the  way  in  which  it  passes  data  are 
presented. 

Image  processing  tasks  can  efficiently  utilize  parallel  computer 
systems.  Chapter  VII  presents  three  parallel  image  processing  algo¬ 
rithms,  a  smoothing  algorithm,  a  histogram  formation  algorithm,  and  a 
data  classification  algorithm.  For  each,  the  results  of  Chapter  iv  and 
V  are  used  to  analyze  the  effect  of  the  interconnection  network  of  the 
parallel  system  upon  the  performance  of  the  algorithm. 

Little  is  known  about  the  interaction  of  interconnection  networks 
and  parallel  algorithms.  An  effective  system  design  aid  would  be  one 
which  simulates  the  effects  of  a  proposed  interconnection  network.  Such 
a  tool,  the  emulator  system,  is  introduced  in  Chapter  VIII.  Consisting 
of  a  set  of  processing  elements  which  interface  to  a  powerful  set  of 
interconnections  among  the  processing  elements,  the  emulator  system  can 
simulate  a  wide  variety  of  existing  and  proposed  interconnection  net¬ 
works.  The  processing  elements  offer  computation  capability  to  test 
schemes  to  control  an  interconnection  network. 

In  this  research,  the  interconnection  networks  presented  in 
Chapters  II  and  III  are  studied.  The  various  analysis  techniques  which 


are  used  for  this  work  can  be  generalized  and  applied  to  other  intercon¬ 
nection  networks.  Thus,  the  significance  of  this  work  lies  not  only  in 
the  specific  results,  but  also  in  the  methods  used  to  obtain  them. 


II.  DEFINITIONS 


I^.  1_.  SIMP  Computers 

Typically,  an  SIMP  machine  (sinqle  instruction  stream  -  multiple 
data  stream)  rFL661  consists  of  a  control  unit  ( CN) ,  M  processors,  M 
memories,  and  an  interconnection  network.  The  CU  broadcasts 
instructions  to  the  N  processors,  and  all  active  processors  execute  the 
same  instruction  at  the  same  time,  but  on  different  data  streams. 
Processors  pass  data  among  themselves  through  the  interconnection 
network.  The  model  for  an  SIi^P  computer  used  here  allows  each  processor 
a  private  memory.  This  combination  is  referred  to  as  a  processing 
element  or  PF.  The  interconnection  network  links  PEs,  as  shown  in 
Figure  II. 1,  and  this  model  is  referred  to  as  the  PE-to-PE  model.  The 
Illiac  IV  E0AR68T  is  configured  in  this  fashion.  Another  model  (Figure 
11. 2),  the  processor-to-memory  model,  uses  the  interconnection  network 
to  move  data  from  the  processors  to  the  memory  and  vice  versa.  The 
orocessors  and  memories  of  the  STARAN  computer  TOATST  are  connected  is 
this  manner.  The  PE-to-PE  model  can  simulate  the  processor-to-memory 
model  and  vice  versa.  If  processor  P(i)  addresses  memory  M(j)  in  the 
processor-to-memory  model,  then  in  the  PE-to-PE  model,  if  PE(i)  passes 
an  address  to  PECj)  and  PE(j)  passes  the  data  from  its  memory  back  to 


B 


PE(i)^  the  effect  is  the  same. 

Each  PE  is  assigned  a  unique  address  from  *1  to  N-1,  where  N  is  a 
power  of  twox  that  is^  ^  =  2^,  and  ri  =  log^  N.  The  address  in  binary  of 
an  arbitrary  PE  P  is  denoted  by  p  .p  ^...p.p^*  When  a  specific  group 
of  PEs  is  referenced,  the  group  can  be  identified  by  specifying  each  bit 
of  the  n  bit  PE  address  as  0,  1,  or  X,  where  X  is  a  "don't  care"  state 
(this  is  based  on  the  "PE  address  mask"  notation  rsiE75,  SIE77a,  SM7B1) . 
Superscripts  are  used  as  repetition  factors.  For  example,  the  set  of 
all  odd  numbered  PEs  is  ^1,  and  the  set  of  all  even  numbered  PEs  is 
X*^  ^0.  It  is  assumed  that  each  PE  knows  its  own  address.  Also,  let  p. 
represent  the  complement  of  p.. 

Each  PE  has  special  data  transfer  registers  (DTRs)  for  passing  data 
to  and  receiving  data  from  the  network.  PEs  load  data  into  hTRin 
registers,  and  the  data  are  moved  by  the  interconnection  network  to  the 
DTRout  registers,  from  which  the  PEs  can  access  the  data. 

SIMD  machines  perform  certain  types  of  tasks,  such  as  matrix 
computations,  faster  than  conventional  single  processor  serial  operation 
computers.  Consider  the  elementwise  addition  of  two  vectors,  A  and  B, 
both  with  N  elements.  Let  the  resultant  sum,  C,  be  stored  as  an  N  word 
vector.  Assume  the  ST*10  machine  has  a  PE-to-PE  configuration,  and  that 
A(i),  B(i),  and  C<i)  are  stored  in  PE(i),  0  £  i  <  N.  To  compute  C,  a 
serial  computer  executes  the  code 

for  i  =  n  until  N-1  step  +1  do 
C(i)  =  A(i)  +  B(i), 

and  uses  N  steps  to  complete  the  operation.  The  SI»1D  computer  is  a 
parallel  processor,  and  earns  this  name  by  processing  all  N  elements  of 


9 


the  vector  addition  in  parallel.  So,  PE(i)  performs  C(i)  =  A(i)  +  5(i), 
simultaneously  for  all  i,  0  ^  i  <  N.  The  SIMD  computer  completes  the 
operation  in  one  step  consisting  of  reading  A(i)  and  R(i)  from  memory, 
adding  the  two,  and  writing  the  result  into  C(i). 

In  the  example  above,  the  SI''1D  machine  completed  the  task  faster 
than  the  serial  processor  because  the  data  were  distributed  among  N  PFs, 
and  no  communication  was  needed  among  the  PEs.  For  some  tasks,  PEs  must 
pass  data  among  themselves,  and  it  is  the  interconnection  network  which 
supports  this  data  movement. 

The  next  example  illustrates  the  function  of  the  interconnection 
network  of  the  SIMD  machine.  Suppose  that  the  vectors  A,  B,  and  C  are 
stored  as  in  the  previous  example,  and  that  each  is  N  words  long. 
Consider  the  code 

for  i  =  1  to  N-1  step  +1  do 
C(i)  =  A(i-1)  +  B(i) 

CIO)  =  B(0). 

The  Smo  computer  performs  this  task  in  five  steps. 

(1)  Before  PE(i)  can  perform  the  addition,  it  must  receive  A(i-1)  from 
PE(i-1).  So,  first,  PEs  B  through  N-2  move  data  into  DTRin 
simultaneously, 

(2)  Next,  the  interconnection  network  is  set  to  move  data  from  PE(i-1) 
to  PE(i),  for  all  i,  1  _<  i  <  N,  simultaneously, 

(■^)  Then,  PE(i)  retrieves  the  data  from  OTRout,  for  all  i,  B  £  i  <  N, 
simultaneously, 

(4)  The  addition  is  done  in  PEs  1  through  N-1  simultaneously. 

(5)  Lastly,  PEIB)  stores  0(0)  in  CtO). 


in 


For  comparison,  the  serial  processor  performs  this  task  in  N  steps,  each 
of  which  is  a  read-add-write  step  or  a  read-write  step. 

These  two  examples  show  that,  while  the  SIMD  machine  has  about  N 
times  the  hardware  of  a  serial  processor,  it  does  not  always  perform  a 
task  N  times  faster.  In  the  second  example,  the  overhead  introduced  by 
data  transfer  among  pEs  limited  the  speedup  of  the  task.  A  more 
extensive  tutorial  on  parallel  processing  is  found  in  '"kuC?/!. 


II. 2.  Interconnection  Networks 


The  interconnection  network  may  take  many  forms.  A  bus  structure 
(Fiqure  II. requires  the  least  hardware  of  any  method.  But,  only  one 
PE  at  a  time  may  use  the  bus,  and  so  transfers  that  require  all  PEs  to 
move  data  are  time  consuminq.  At  the  opposite  extreme,  a  crossbar 
switch  matrix  (Figure  II. 4)  can  connect  any  PE  to  any  other  PE  and  can 
allow  all  PEs  to  transfer  data  simultaneously.  Since  a  switch  is 
required  at  each  switching  node  of  the  crossbar,  O(N^)  gates  are 
required.  Thus,  the  network  is  too  expensive  for  use  with  a  large 
number  of  processors.  Benes  rBE651  proposed  the  rearrangeable  switching 
network,  which  has  the  same  capability  as  the  crossbar  but  uses  only 
0(N  log2  N)  gates.  But  the  fastest  algorithm  to  set  up  the  network 
requires  time  0(N  log^  N)  FOPT/ll. 

A  practical  interconnection  network  must  compromise  the  speed  of 
the  crossbar  and  the  cost  of  the  bus.  This  work  will  consider  networks 
that  are  less  complex  than  the  crossbar  but  faster  than  the  bus. 

An  interconnection  network  can  be  described  as  a  set  of 
interconnection  functions,  where  each  interconnection  function  is  a 
permutation  (bijection)  on  the  set  of  PE  addresses  rsiE75,  SIE77aT.  When 
interconnection  function  f  is  applied,  PE(i),  if  active,  oasses  its  data 
to  PE(f(i))  for  all  i,  0  £  i  <  N,  simultaneously.  To  pass  data  from  one 
PE  to  another  PE,  a  programmed  sequence  of  interconnection  functions 
must  be  executed.  An  equivalent  definition  is  that  the  interconnection 
network  takes  the  set  of  PE  addresses  as  its  input  and  produces  as  its 
output  a  permutation  of  these  PE  addresses,  i.e.,  it  transforms  (or 
maps)  input  address  I  to  output  address  0.  For  example,  suppose  that 

V 

I 

5 


PE(i)  wishes  to  send  data  to  PE(1+1).  The  resulting  permutation  Is  f(1) 
=  (1  +  1)  modulo  where  1  is  the  address  of  the  PE  at  the  input  of  the 
network,  and  f(i)  is  the  address  of  the  PE  that  receives  data  at  the 
output  of  the  network.  These  two  definitions  will  be  used 
interchangeably . 

Four  interconnection  networks  are  of  particular  interest  here. 

The  Cube  network  consists  of  the  n  functions  defined  by 

Cubei(Pn_^...DlPf,)  =  p^_^  . .  .p^^^p  .p^_^  . .  .p,^, 

for  n  <  i  <  n  rsiE75,  SIE77al. 

The  Cube  interconnection  functions  can  be  interpreted 
geometrically.  Let  the  PE  addresses  represent  the  vertices  of  an  n 
cube.  For  n  =  3,  the  eight  vertices  of  Figure  II. 5  are  the  the  eight 
PEs  with  addresses  000  through  111.  Let  the  address  at  a  vertex  be  p  = 
P^PiPq.  The  Cube  network  has  the  effect  of  connecting  each  vertex  to 
its  n  neighbors,  that  is,  those  PEs  whose  binary  addresses  differ  in 
only  one  bit  position.  In  Figure  II. 5,  horizontal  lines  connect  vertices 
whose  labels  differ  in  bit  p^,  diagonal  lines  connect  vertices  whose 
labels  differ  in  bit  p^,  and  vertical  lines  connect  vertices  whose 
labels  differ  in  bit  p^-  For  example,  Cube^  connects  the  following 
pairs  of  processors:  000  and  001,  010  and  011,  100  and  101,  and  110  and 
111.  That  is,  Cubeg  (0)  =  1,  Cube^  (1)  =0,  Cube^,  (2)  =  T,  Cube^^  (3)  = 
2,  Cuhe^  (4)  =  5,  Cube^  (5)  =  4,  Cube^  (4)  =  7,  and  Cube^  (7)  =  6. 

Various  types  of  cube  networks  have  been  explored.  The  multistage 
network  used  in  the  STARAN  is  a  hardware  series  of  cube  functions 
rBA750.  The  SW  Banyon  with  S  =  F  =  2  TOLT'^,  60K76T  is  a  cube  type 


network 


The  delta  networks  proposed  in  TPATTOT  include  the  Cube 


16 


topology.  The  interconnection  network  for  the  CHOPP  multiprocessor 
system  rSUB771  employs  the  cube  interconnection  functions.  In  rBA76^ 
BAU74,  PEA77,  SIE79b^  SIE7BbT,  the  usefulness  of  this  type  of  network  is 
shown. 

The  Shuffle-Exchange  network  rsT7n  consists  of  two  functions.  The 
Shuffle  function  is  defined  as 

Shuff le(p^_^ . . .p^p^)  =  p^_,. . .p^PpP^_^ . 

The  Exchange  function  is  defined  as 

Exchange(p^.^p^_2...p^Pn)  =  p^.^p^.^. . .P^Fp. 

The  Shuffle  function  is  analogous  to  shuffling  a  deck  of  cards^  as 
shown  in  Figure  Il.fSa  for  N  =  8.  The  top  and  bottom  cards  of  the  deck 
remain  stationary,  i.e.,  ShuffleCO)  =  0  and  Shuffle(7)  =  7.  The 

remaining  cards  are  intermixed,  one  from  the  first  half  of  the  deck 
followed  by  one  from  the  second  half  of  the  deck.  Figure  II. 6b 
illustrates  the  Exchange  function  for  M  =  Without  the  Exchange 
function,  all  permutations  of  input  addresses  to  output  addresses  which 
the  Shuffle-Exchange  network  could  form  would  require  that  PEs  0  and  N-1 
be  mapped  to  themselves.  Note  that  Exchange  (P)  =  Cube^  (P) . 

This  network  is  the  basis  of  Lawrie's  omega  network  TLAWTSI.  It  is 
also  included  in  the  networks  of  the  RAP  TCGY74D  and  Omen  rHI0721 
systems.  It  has  been  shown  to  be  useful  in  rG0L61,  LAN76,  LAST76, 
SIE79b,  SIE78b,  ST7i:. 

The  Plus-Minus  2^  (PM2I)  network  consists  of  the  2n  functions 
defined  by 

PM2. .(j)  =  i  +  2^  modulo  N 

PM2  .(j)  =  i  -  2^  modulo  N 
-1 


1R 


for  n^j<N^r)j<i<n  rsiE75,  SIE77al.  Throuqhout  this  discussion,  if 
(j  "  2’)  <  Q,  then  the  convention  will  be  that  (i  -  ?’)  modulo  N  = 
<N  +  j  -  2^)  modulo  N.  For  example,  <0  -  2)  modulo  R  =  (R  +  0  -  2) 
modulo  R  =  6  modulo  R.  Note  that  A  PM2I 
interconnection  function  has  the  effect  of  adding  or  subtracting  1  in 
the  i^^  bit  position.  Figure  II. 7  shows  the  interconnections  for 
N  =  R. 

Feng's  multistage  data  manipulator  rFE74']  is  a  hardware  series  of 
PM2I  functions.  The  augmented  data  manipulator  rsiS7Rl  is  a  multistage 
PM2I  network  with  a  very  general  control  structure.  The  usefulness  of 
the  P^2I  network  is  discussed  in  rFE74,  SIE79b,  SIE7Rb,  SISTRI. 

The  Illiac  network  is  the  network  used  on  the  Illiac  IV  computer 
CBAR6RT.  The  PEs  are  configured  as  a  -/N  X  -/N  array,  and  the 
interconnection  network  has  the  effect  of  connecting  each  PE  to  its 
north,  south,  east,  and  west  neighbors,  as  shown  in  Figure  II. R.  The 
four  interconnection  functions  are 


lUiac^^ 

(i)  = 

<i 

+ 

1) 

modulo 

N 

(east) 

Illiac_^ 

<i)  = 

(i 

- 

1) 

modulo 

N 

(west) 

Illiac. 

+m 

(i)  = 

(i 

+ 

m) 

modulo 

M 

(south) 

Illiac 

-m 

(i)  = 

<i 

- 

m) 

modulo 

N 

(north) 

where  N  =  2^,  and  m  f  VW  is  an  integer  FSIE75,  SIE77aT.  The  Illiac 
interconnection  functions  are  a  subset  of  the  P!12I  functions,  where 
Illiac^^  (i)  =  P«12^^  (i),  Illiac_^  (i)  =  Pfl2_g  (i),  Uliac^^  (i)  = 
P”2+n/2  "  '’^^■-n/2 


Figure  II. 8:  For  N  =  16,  the  ILliac 
modulo  16),  to  PE(i-1  modulo  16),  to 
modulo  16). 


II. 3.  Network  Structures 


A  network  can  be  constructed  as  either  a  recirculating  or  a 
multistage  network.  A  recirculating  network  is  an  interconnection 
network  with  a  single  stage  of  switches.  The  stage  is  reused  until  data 
reach  their  final  destinations.  A  complete  data  transfer  may  take 
several  passes  through  the  network.  Figure  II. 9  illustrates  this 
arrangement.  A  multistage  network  is  an  interconnection  network 
composed  of  several,  usually  log^  N,  stages  of  combinational  logic 
switches.  In  general,  a  single  pass  through  a  multistage  network  is 
sufficient  to  route  data  to  their  destinations.  However,  when  a  single 
pass  is  insufficient,  multiple  passes  may  be  used.  Figure  II. ID 
illustrates  a  multistage  network. 

For  constructing  multistage  networks  such  as  the  STARAN  and  omega 
networks,  the  interchange  box  is  a  useful  building  block  TSISTHT.  The 
interchange  box  is  a  two-input  two-output  device  that,  in  the  most 
general  case,  may  assume  one  of  four  legitimate  states  (Figure  11.11). 
Let  the  upper  input  and  output  lines  be  labeled  i  and  the  lower  input 
and  output  lines  be  labeled  j.  The  four  legitimate  states  are:  (1) 
straight  -  input  i  to  output  i,  input  }  to  output  j;  (?)  exchange  - 
input  i  to  output  j,  input  j  to  output  i;  (3)  lower  broadcast  -  input  j 
to  outputs  i  and  j;  (4)  upper  broadcast  -  input  i  to  outputs  i  and  j 
TLAWTSI.  A  two  function  interchange  box  is  defined  to  be  an  interchange 
box  capable  of  either  the  straight  or  exchange  states.  A  four  function 
interchange  box  is  defined  to  be  an  interchange  box  capable  of  being  in 
any  of  the  four  legitimate  states. 


Interconnection 


STRAIGHT 


INTERCHANGE 


UPPER 


LOHER 


BROADCAST 


BROADCAST 


Figure  11.11:  The  Interchange  box  Is  a  two-input  two-output  device  that 


can  be  in  one  of  four  legitimate  states:  straight,  exchange.  Lower 


broadcast,  or  upper  broadcast 


25 


The  control  structure  of  a  network  is  an  important  consideration. 
For  multistage  networks^  three  types  of  controls  are  discussed  in 
TSIS78].  Individual  stage  control  allows  one  control  signal  for  each 
stage  of  the  network.  Individual  box  control  uses  a  separate  control 
signal  for  each  interchange  box  in  the  network,  using  hardware  rPEA77T 
or  software  (destination  tags)  iLAwyiJT.  Partial  stage  control  uses  more 
than  one  but  less  than  N/2  control  signals  at  any  stage  of  the  network. 

The  typical  control  mechanism  for  a  recirculating  network  assumes 
that  only  an  active  PE  can  send  and  receive  data.  An  inactive  PE  can 
only  receive  data,  because  an  interprocessor  data  transfer  instruction 
is  executed  only  by  active  PEs.  Here,  a  control  is  introduced  that 
differs  from  the  usual  SIMD  control,  where  there  is  a  single  instruction 
stream,  and  all  active  PEs  must  execute  the  same  interconnection 
function.  By  providing  each  PE  with  its  own  routing  control  register, 
this  restriction  is  removed.  Independent  function  control  allows  each 
PE  to  execute  any  set  of  the  implemented  interconnection  functions.  For 
example,  using  a  Cube  network,  PE(P)  might  send  data  to  all  PEs  whose 
addresses  differ  in  one  bit  from  P's  address  by  executing  all  log,  N 
Cube  functions,  Cube^^,  Cube^,  etc.  Also,  different  PEs  may  use 
different  functions,  e.  g.,  PECO)  may  use  Cube^  while  PE(1)  uses  Cube^. 


iJL*2.*  £§.  Address  Masks 

In  the  normal  execution  of  an  SIMD  program,  all  PEs  will  respond  to 
the  instructions  issued  by  the  control  unit.  A  masking  scheme  can  be 
provided  which  allows  the  user  to  select  a  subset  of  PEs  to  respond  to 
the  instructions.  To  select  which  PEs  are  active,  an  n  position  PE 
address  mask  may  accompany  an  instruction  rsiE75,  SIE77al.  Recall  the 
address  specification  notation  of  section  II. 1.  Each  position  of  the 
mask  can  be  0^  or  X  (don’t  care).  For  a  given  mask,  the  PEs  whose 
addresses  match  the  mask  are  active.  For  example,  if  N=R  and  the  mask 
specified  is 

MASK  roiX] 

then  the  active  PEs  are  010  and  011,  and  only  these  two  respond  to  the 
instruction  which  follows  the  MASK  command. 

A  negative  PE  address  mask  is  the  same  as  a  regular  PE  address 
mask,  except  that  it  activates  all  those  processors  which  do  not  match 
the  mask  rsM7R].  This  type  of  mask  can  activate  sets  of  processors  that 
a  single  regular  mask  cannot.  A  negative  PE  address  mask  is  prefixed  by 
a  Superscripts  are  used  as  repetition  factors  when  describing 

masks.  For  example  for  N  =  2*^,  the  command 

MASK  r-o"i 

activates  all  PEs  except  PECO). 

Logical  operations  can  be  applied  to  two  PE  address  masks  to 
specify  another  set  of  PEs.  The  logical  OR  of  two  masks  forms  the  union 
of  the  two  sets  of  PEs  specified  by  the  masks.  The  logical  AND  of  two 
masks  forms  the  intersection  of  the  two  sets  of  PEs.  For  example,  the 


command 


27 


MASK  rx"”^oi  OR  rx^'^nn 

activates  all  PEs  with  even  addresses  and  all  PEs  whose  addresses  end  in 

ni. 

Other  masking  schemes  may  be  used.  The  general  address  masks  of 
the  Illiac  IV  computer  use  a  bit  vector  of  length  N  roAR6R3.  PE(i)  is 
active  if  and  only  if  the  i^^  bit  of  the  vector  is  one.  Data 
conditional  masks  reflect  "if-then-else"  statements  rSM783.  When  a 
conditional  statement  is  encountered  in  a  program^  each  PE  executes  the 
statement  for  different  data,  and  so  the  outcome  may  be  different  from 
one  PE  to  the  next.  Consequently^  each  PE  sets  an  internal  flag  so  that 
it  will  be  active  for  either  the  "then"  or  the  "else"  but  not  both.  So, 
each  PE  is  conditionally  active  based  on  the  results  of  a  comparison. 

In  describing  the  simulation  algorithms  of  Chapter  VIII,  an  Algol- 
like  language  will  be  used.  It  includes  statements  that  indicate  which 
PEs  are  to  be  active  during  execution  of  an  algorithm,  "for  all  PEs" 
means  that  all  N  PEs  are  to  execute  the  code  which  follows,  "if  A  then 
8"  statements  first  cause  A  to  be  evaluated.  Only  PEs  for  which  A  is 
true  are  active  for  the  execution  of  B;  all  others  are  inactive,  "if  A 
then  B  else  C"  statements  cause  A  to  be  evaluated,  disable  PEs  for  which 
A  is  false,  execute  B,  disable  PEs  where  A  is  true  and  enable  PEs  for 
which  A  is  false,  and  then  execute  C  rSM7R1. 


I 


?.8 


III.  LITERATURE  REVIEW 


The  traditional  N  X  N  crossbar  switch  is  too  expensive  for  use  in  a 
Large  SIMO  computer.  Other  networks  have  been  proposed  that  can  produce 
all  permutations  of  PE  addresses  in  one  pass  through  the  network. 

One  such  network  is  the  rearrangeable  switching  network  C3E65], 
which  uses  0(N  log^  N)  gates^  but  requires  0<N  log^  N)  time  to  set  up 
the  network  C0PT71D.  Feierbach  and  Stevenson  CFS77D  have  investigated 
this  network  for  use  in  an  SIMO  computer  with  1024  PEs.  Algorithms  are 
presented  which  implement  a  k-shift  (PE(j)  sends  data  to  PE(k+j)),  the 
perfect  shuffle,  and  broadcasting  (PE(j)  sends  data  to  all  other  PEs). 
Figure  III.1  shows  this  network  built  using  two  function  interchange 
boxes  for  N  =  8. 


Batcher's 

sorting  network 

CBA68, 

KN73] 

could 

be  used 

as  an 

interconnection 

network. 

It 

requi res 

0<N 

(logj,  N) 

gates. 

and 

requires  0((logp 

N)^)  time  to 

pass  data 

through  the 

network. 

This 

network  is  shown  for  N  =  8  in  Figure  III. 2  IISIE78bD.  The  building  block 
for  this  network  compares  its  two  inputs  and  orders  them  accordingly  at 
the  outputs. 

Both  recirculating  and  multistage  Shuffle-Exchange  networks  have 


been  examined  in  the  literature. 


Four  2  - 

ELEMENT 

BITONIC 

SEQUENCE 

SORTERS 


Figure  III. 2:  A  bitonic  sorter  for  an  arbitrary  sequence  of  elements 


CSIE^SbD. 


Two  2  -  ELEMENT  B I  TON  1C 

SEQUENCE  SORTERS 


One  2^  -  ELEMENT  BITONIC  SEQUENCE  SORTER 


stone  CST71I1  has  published  algorithms  which  show  how  the  perfect 
shuffle  can  be  used.  Algorithms  for  polynomial  evaluation,  sorting 
using  Batcher's  bitonic  sorting  algorithm,  calculating  the  FFT  using 
Pease's  algorithm,  and  matrix  transposition  are  presented. 

Lawrie's  omega  network  tILAW73,  LAW75D  is  an  expanded  multistage 
Shuffle-Exchange  network.  The  omega  network  is  an  n  stage  network,  where 
each  stage  is  a  Shuffle  followed  by  a  four  function  interchange  box  as 
shown  in  Figure  III. 3.  A  control  procedure  for  the  omega  network  was 
presented  in  CLAW75D  where  destination  tags  for  each  datum  determine  its 
path  through  the  network. 

Wen  rWEN76]  also  has  presented  results  for  the  omega  network. 
Various  types  of  control  methods  are  investigated  such  as  passing 
destination  tags  and  using  read  only  memory  to  store  control 

information.  Also  presented  are  methods  of  broadcasting  one  datum  to  2 
other  PEs  and  partitioning  the  omega  network  into  groups  of  2^  PEs  out 
of  2^  PE's.  Parallel  algorithms  for  linear  recurrence  relations  and 
matrix  multip'.ications  are  presented  and  analyzed. 

Lang  CLAN76]  has  studied  the  Shuffle-Exchange  network.  He  has 
presented  a  modification  of  a  recirculating  network  which,  by  adding 
queues  at  the  input  to  the  network,  can  realize  any  permutation  in  at 


worst  0(-/N)  time. 

A 

simplified 

Shuffle-Exchange 

network  has  been 

presented  CLAST76]. 

This 

network 

needs 

less  logic 

to  control  the 

network  than  the 

omega 

network. 

While 

it  cannot 

perform  all  omega 

network  permutations 

/  it 

can  form 

many 

useful  ones 

.  A  method  of 

partitioning  this  network  has  been  given  which  assumes  that  all 
partitions  perform  the  same  interconnection  function. 


33 


Several  Cube  networks  have  been  presented  in  the  literature.  The 
CHOPP  machine  CSUB77],  a  multiple  instruction  stream  -  multiple  data 
stream  (MIMD)  machine  CFL663  design,  uses  a  recirculating  cube  network 
to  move  packets  of  information  among  processors.  The  flip  network  of 
the  STARAN  CBA763  SIMD  machine  is  a  multistage  Cube  network  which  moves 
data  between  processors  and  memory  or  between  processors  and  processors. 
Two  set  of  controls  are  provided  for  the  flip  network  as  shown  in  Figure 
III. 4.  The  flip  controls  are  individual  stage  control  for  the  network. 
The  shift  controls  are  partial  stage  control  and  allow  the  network 
perform  all  shifts  of  2^  modulo  2\  0  <  i,j  <  n.  That  is,  the 
permutations  of  PE  addresses  where  PE<k)  sends  data  to  PE(k  +  z’  modulo 
2^)  for  all  k,  0  £  k  <  N.  Pease  CPEA77T  has  also  worked  with  a 
multistage  Cube  network,  the  indirect  binary  n-cube  (Figure  III. 5),  for 
use  in  systems  with  large  numbers  of  processors.  He  has  shown  how  such 
a  network  could  be  used  for  spectral  analysis  algorithms  and  matrix 
operations.  In  rsiS78D,  a  generalized  cube  network  (Figure  III. 6),  a 
multistage  Cube  network,  was  introduced  and  used  as  a  basis  for 
comparing  multistage  Cube  networks. 

Feng’s  data  manipulator  nFE74D  is  based  on  the  PM2I  functions.  The 
data  manipulator  network  (Figure  III. 7)  consists  of  n  stages  of  N  cells. 


where  for  0  £  j  <  N, 

and  0 

<  i 

<  n,  there 

are  three  sets 

of 

interconnections  from 

i  nput 

cel  1 

j  at  stage 

i:  PM2_^.,  PMZ 
+1 

and 

straight  to  output  cell  j.  Each  stage  of  the  network  is  controlled  by  a 
pair  of  signals  selected  from  a  group  of  six.  (PM2_.),  D^j  (PM2^.), 
and  (straight)  control  cells  whose  i^*’  bit  of  the  address  is  0,  and 


Figure  III. 4:  The  STARAN  flip  network  for  N  =  8  CBA76T.  (a)  The  flip 
control  is  individual  stage  control,  (b)  The  shift  control  is  partial 


stage  control. 


Figure  III. 5:  The  indirect  binary  n-cube  is  a  log^  N  stage  cube  net¬ 
work,  shown  here  for  N  =  B.  The  first  stage  forms  Cubeg,  the  second 
Cube^^  and  the  last  Cube^  CPEA77]. 


Figure  III. 6:  The  generalized  cube  network^  shown  here  for  N  =  8, 
Log2  N  stage  cube  network.  The  first  stage^  stage  2,  forms  Cube 
next  forms  Cube^,  and  the  Last  forms  Cube^. 


38 


(PM2  (PM2^.),  and  (straight)  control  those  cells  whose  i'^ 
bit  is  1. 

The  augmented  data  manipulator  rsiS78,  SIE79a3  is  a  data 
manipulator  with  individual  cell  control.  That  is^  each  cell  receives 
none^  one,  or  two  of  the  signals  H,  U,  and  D.  Since  each  cell  in 
controlled  independently,  the  set  of  permutations  that  the  network  can 
perform  is  a  superset  of  those  of  the  generalized  cube  network  CSIS78T. 

Siegel  has  presented  comparisons  of  the  Shuffle-Exchange,  Cube, 
PM2I,  and  Illiac  networks  CSIE77b3.  Lower  and  upper  time  bounds  have 
been  presented  for  each  network  to  simulate  any  other  CSIE77a,  SIE79bT. 
The  effects  of  PE  address  masks  have  been  related  to  the  number  of 
permutations  on  the  set  of  PE  addresses  that  a  network  can  perform 
CSIE77a3.  Algorithms  have  been  presented  which  show  how  a  network  can 
simulate  an  arbitrary  interconnection  with  the  aid  of  Batcher's  bitonic 
sorting  algorithm  by  sorting  destination  tags  associated  with  the  data 
presented  to  the  network  I!SIE78b3.  In  CSIS78,  SIE79aIl,  it  was  shown 
that  some  multistage  networks  have  the  same  topologies  and  so  can 
perform  the  same  permutations.  This  fact  will  be  used  extensively  in 
proving  theorems  in  Chapters  V  and  VIII. 


39 


IV.  NETWORK  STRUCTURES 

Introduction 

Three  types  of  interconnection  functions,  the  Cube,  the  ShuffLe- 
Exchange,  and  the  Plus-Minus  2^,  are  implemented  as  recirculating 
(single  stage)  networks  and  as  multistage  combinational  logic  networks 
in  section  2.  Comparisons  are  made  on  hardware  complexity  and  delay  to 
transfer  data.  The  Shuffle-No  Shuffle-Exchange  network  is  introduced  in 
section  3.  This  network  has  all  the  capabilities  of  the  multistage 
Shuffle-Exchange  network,  but  in  addition,  can  perform  1  to  n-1  shuffles 
in  one  pass  through  the  network.  Section  4  considers  breaking  a  data 
word  into  segments  before  passing  the  datum  through  the  network.  Then, 
the  width  of  the  network  is  smaller  than  when  the  wider  data  word  is 
passed  all  at  once.  So,  more  than  one  pass  is  made  through  the  network 
to  pass  the  data.  Alternatively,  for  a  multistage  network,  the  network 
can  be  pipelined  and  the  S  segments  passed  in  parallel.  The  cost  and 
delay  to  pass  S  segments  are  compared  for  these  cases. 


t 


40 


IV. 2.  Hardware  Implementations 

In  order  to  compare  these  networks,  typical  circuits  for  each  are 
presented.  Comparisons  of  gate  count  and  circuit  delay  are  made  for 
multistage  and  recirculating  Shuffle-Exchange,  Cube,  and  PM2I  networks. 
For  simplicity,  the  following  analysis  is  made  on  networks  that  are  one 
bit  wide.  The  costs  of  DTRin,  DTRout,  and  any  hardware  needed  to 
interface  with  the  network  are  not  included  in  the  network  cost 
estimates.  The  delay  times  presented  are  intended  as  approximations  and 
are  useful  for  comparisons.  In  practice,  the  speed  of  the  network  will 
also  depend  on  the  speed  at  which  control  signals  can  be  generated  and 
on  the  technology  of  the  circuit  design.  Table  IV. 1  summarizes  the 
notation  that  will  be  used  in  the  discussion. 

Three  multistage  Cube  networks  which  have  been  presented  in  the 
literature  are  the  STARAN  flip  network  CBA76],  the  indirect  binary  n- 
cube  [IPEA77D,  and  the  generalized  cube  CSIS78].  In  CSIS78],  it  was 
shown  that  these  three  networks  and  an  n-stage  Shuffle-Exchange  network 
are  all  topologically  equivalent. 

With  this  in  mind,  a  circuit  for  an  8-item  generalized  cube  network 
is  presented  in  Figure  III. 6.  At  the  input  and  output  to  each  stage, 
each  line  has  an  n  bit  binary  address,  P,  0  £  P  <  N.  Stage  i  compares 
input  lines  I  and  I',  whose  addresses  differ  in  only  bit  i,  and 
conditionally  exchanges  data  between  I  and  I'.  In  this  way,  a  Cube, 
function  is  implemented.  The  interchange  box  (Figure  IV. 1),  on  which 
this  Cube  network  can  be  based,  conditionally  interchanges  the  data  at 
its  inputs,  thus  performing  a  conditional  exchange.  So,  stage  i  of  the 
network  forms  Cube^,  0  £  i  <  n.  The  circuit  uses  n*N/2  interchange 


41 


Table  IV. 1:  Definitions  and  abbreviations 


NOTATION  MEANING 


dr  delay  of  a  register 

cr  cost  of  a  register 

dm  delay  of  a  multiplexer 

cm  cost  of  a  multiplexer 

dms  delay  of  the  logic  for  one 

stage  of  a  multistage  network 

cms  cost  of  the  logic  for  one 

stage  of  a  multistage  network 
drn  delay  of  the  logic  for  a 
recirculating  network 
drs  delay  of  the  logic  for  a 

recirculating  Shuffle-Exchange 
network 

cbr  cost  of  the  logic  (buffers) 
for  a  recirculating  network 
Cr  cost  of  a  recirculating  network 

Cp  cost  of  a  pipelined 

multistage  network 
Cm  cost  of  a  combinational  logic 
multistage  network 
Tr  time  delay  of  a  recirculating 
network 

Tm  time  delay  of  a  combinational 
logic  multistage  network 
Tp  time  delay  of  an  n-stage  pipelined 
multistage  network 

Tk  time  delay  of  a  k-stage  pipelined 
multistage  network 
W  width  of  a  data  word  to  be 

transmitted  through  the  network 
S  number  of  segments  into  which  a 
data  word  is  divided 
Q  the  number  of  passes  made  through 
a  recirculating  network  to 
complete  a  desired  transfer 
N  the  number  of  PEs  in  the  system 

n  log2  N 


43 


j 


boxes,  or  7n*N/2  gates,  and  has  a  delay  of  dms*n  +  2*dr,  where  dms  is 
the  delay  through  the  interchange  box,  and  2*dr  represents  the  delay 
through  DTRin  and  DTRout. 

As  a  design  example,  suppose  it  is  desired  to  build  a  multistage 
generalized  cube  network  for  N  =  1024.  This  10  stage  network  could  be 
designed  using  off-the-shelf  components.  An  inverting  interchange  box 
made  from  AND-OR-INVERT  gates  (SN7451)  can  be  used  in  place  of  the  NAND 
gates  in  Figure  IV. 1.  This  circuit  (Figure  IV. 2)  performs  the  same 
function  as  the  interchange  box,  but  the  outputs  are  complemented. 
Since  interchange  boxes  are  cascaded  to  form  a  multistage  network,  if  n 
is  even,  these  inverting  outputs  cancel.  For  example,  after  stage  n-1, 
the  data  are  complemented,  but  after  stage  n-2,  the  data  are  true.  A 
design  for  a  10  stage  1  bit  wide  generalized  cube  network  would  require 
512*10  =  5120  dual  2-wide  2-input  AND-OR-INVERT  chips  (SN7451)  and 
5120/6  =  854  hex  inverter  chips  (SN7404).  This  is  a  total  of  5974 

integrated  circuit  packages,  or  less  that  6  integrated  circuit  packages 
per  bit  per  PE. 

Alternatively,  an  interchange  box  can  be  constructed  from  two 

2-line-to-1-line  multiplexers.  Let  the  output  of  multiplexer  zero  be 

DO  and  let  the  output  of  multiplexer  one  be  D1  If  exchange  =  0, 

out'  out 

then  multiplexer  zero  selects  DO.  and  multiplexer  one  selects  D1 .  .  If 

in  in 

exchange  =  1,  then  multiplexer  zero  selects  and  multiplexer  one 

selects  DO.  .  One  integrated  circuit  (an  SN74157)  may  contain  four  such 
in 

multiplexers  which  all  respond  to  the  same  control  signals.  So,  a  2-bit 
wide  generalized  cube  network  for  N  =  1024  can  be  built  using  512*10  = 
5120  quadruple  2-line-to-1-line  multiplexers  (SN74157),  or  less  than  3 


45 


integrated  circuit  packages  per  bit  per  PE. 

A  multistage  PM2I  network  can  be  built  from  the  modules  of  Figure 
IV. 3  CFE74].  Referring  to  the  multistage  network  for  N=8  in  Figure 

III. 7,  a  logic  module  of  stage  2  has  the  construction  of  the  left-hand 
side  of  the  module  of  Figure  IV. 3^  while  the  receiving  right-hand  side 
of  the  module  lies  in  stage  1  of  Figure  III. 7.  The  number  of  gates 
needed  for  an  n  stage  PM2I  network  is  4*N*n,  and  the  delay  is  dms*n+2dr. 

A  10  stage  1  bit  wide  PM2I  network  could  be  designed  from  discrete 
components  using  3*1024*10/4  =  7680  2-input  NANb  packages  (SN7400>  and  •, 
11024*10/3 1  =  3414  3-input  NAND  packages  (SN7410).  This  is  11,094 
integrated  circuit  packages,  or  less  than  11  chips  per  bit  per  PE. 

Figure  IV. 4  shows  a  circuit  for  a  recirculating  Shuffle-Exchange 
network.  At  any  pass  through  the  network,  either  a  shuffle  or  an 
exchange  may  take  place.  Referring  to  Figure  II. 9,  to  accomplish 
multiple  passes  through  the  network,  a  recirculating  network  allows  a 
multiplexer  to  select  data  from  DTRin  or  DTRout  for  input  to  the 
network.  Q  passes  through  the  network  may  be  required  to  complete  a 
desired  transfer  of  data  between  PEs.  This  circuit  uses  3N  gates  and 
has  a  delay  of  dr  +  Q(  dr  +  dm  +  drs  ). 

The  recirculating  Cube  network  can  be  built  using  tri-state  buffers 
with  outputs  that  can  be  OR-TIEO  together,  as  shown  in  Figure  IV. 5.  A 
recirculating  PM2I  network  could  be  constructed  similarly  (Figure  IV. 6). 
The  Cube  network  uses  N*n  buffers  while  the  PM2I  network  uses  2*N*(n-1) 
buffers,  since  =  PM2_^^_^j.  Both  have  delay  dr  +  Q  (  dr  +  dm 

+  drn  ).  For  small  n,  the  Cube  network  could  be  built  using  2  levels  of 
NAND  gates,  where  the  input  to  the  network  is  composed  of  2-input  NANDs 


From  PE (Shuffle 


Shuffle 


Exchange 


From  PE (Exchange (i)) 


Figure  IV. 4:  A  circuit  for  a  recirculating  Shuffle-Exchange  network  for 


From  PE (Cube, (I )) 


CONTROL 


OUTPUT 


INPUT 


TR I -STATE  BUFFER 


Figure  IV. 5:  A  circuit  for  a  recirculating  Cube  network  for  PE(i) 


and  the  receiving  side  is  composed  of  n-input  NANDS.  For  large  n,  due 
to  fan-in  and  integrated  circuit  package  count  Limitations,  the  tri¬ 
state  buffer  design  is  preferable. 

When  selecting  an  interconnection  network,  the  time  to  complete  a 
data  transfer  is  an  important  consideration.  For  a  recirculating 
network,  this  time  is  proportional  to  the  number  of  passes  made  through 
the  network,  Q,  0  £  Q  £  n.  In  the  case  of  the  multistage  network,  for 
any  data  transfer,  all  n  stages  must  be  traversed,  so  as  the  number  of 
PEs  grows,  the  time  to  pass  data  increases. 

For  some  value  of  Q,  a  multistage  network  and  a  recirculating 
network  will  have  the  same  delay  time.  If  dr  =  6  ns,  dm  =  5  ns,  drn  = 
4.5  ns,  and  dms  *  6  ns,  then  the  two  networks  will  require  the  same 
delay  time  if  Q  =  .39  (1  +  n) .  This  indicates  that  the  choice  between  a 
recirculating  network  and  a  multistage  network  should  depend  on  the 
number  of  interconnection  functions  used  on  the  average  to  complete  a 
data  transfer,  and  thus  depend  on  the  types  of  problems  that  a  system  is 
designed  to  perform.  For  example,  if  skewed  storage  CST75]  is  used, 
there  will  be  uniform  shifts  which  may  require  all  n  Cube  functions  of  a 
multistage  network.  However,  sorting  using  the  Cube  functions  will  need 
only  one  interconnection  function  at  a  time  CSIE73bI!.  Figure  IV. 7 
illustrates  this  effect  for  an  n-stage  generalized  cube  and  a 
recirculating  Cube  network. 

In  Chapter  II,  three  types  of  controls  for  multistage  networks  were 
defined:  individual  stage  control,  individual  box  control,  and  partial 
stage  control.  The  multistage  network  implementations  discussed  here 
can  be  used  with  any  of  these  control  schemes.  The  recirculating 


51 


Figure  IV. 7:  3reak  even  points  for  a  multistage  Cube  network  versus  a 
recirculating  Cube  network. 


V-* 

& 


53 


IV.S.  The  Shuf f le~No  Shuffle-Exchange  Network 

For  the  Cube  and  PM2I  networks,  any  data  transfer  that  can  be 

accomplished  in  one  pass  through  a  recirculating  network  can  be  done  in 

one  pass  through  a  multistage  network.  This  is  not  true  for  the 

multistage  Shuffle-Exchange  network.  Recall  from  Chapter  II  that  the 

shuffle  permutation  maps  network  input  P  =  p  ,..p.p_  to  network  output 

n- 1  1  u 

Shuffle(P)  =  p^.2---PlPoPn-1* 

From  Theorem  2  of  IILAW75I!,  it  can  be  shown  that  the  multistage  Shuffle- 
Exchange  network  cannot  perform  the  shuffle  permutation  in  a  single  pass 
through  the  network.  Theorem  2  states  that  for  all  mappings  of  source  PE 
Si  to  destination  PE  Di  that  define  a  permutation  of  input  PEs  to  output 
PEs,  a  log2N  stage  Shuffle-Exchange  network  can  produce  this  mapping  if 
and  only  if 


Si  *  Sj  => 

(  Si  modulo  2*^  ¥  Sj  modulo  2*^ 

OR  modulo  2*^  ^  l-^l  modulo  2*^  S, 

2  2*^ 

for  all  k,  1  £  k  £  n,  and  for  all  i,  j,  0  £  i,  j  <  N,  where  lA]  is  the 
greatest  integer  that  is  less  than  or  equal  to  A.  Consider  two  source 
destination  pairs  that  occur  in  specifying  the  shuffle  permutation: 

(Si,  Di)  =  ^0Pp_2* •  •P'^Pg^  Pn-2* ' ’PiPQ^^ 

(Sj,  Dj)  “  (1p^^2 * •  ^Pi^Pq,  Pp— 2 " * *P ^Pq^ ^  * 

Then,  Si  ¥  Sj,  but  Si  modulo  2  =  Sj  modulo  2  and  Di/2  modulo  2*^  ^ 


=  Dj/2 


54 

modulo  2^”^.  For  k  =  the  two  source  destination  pairs  do  not  satisfy 
the  criteria  of  Theorem  2  of  CLAW75II,  and  so  the  multistage  Shuffle- 
Exchange  network  cannot  pass  the  shuffle  permutation. 

To  rectify  this^  a  Shuffle-No  Shuffle-Exchange  (SNSE)  network  is 
introduced  here.  At  each  stage  of  this  multistage  network  (Figure 
IV. 8),  the  options  are  staight  (do  nothing).  Shuffle,  Exchange,  or 
Shuffle-Exchange.  Data  passes  through  stage  n-1,  ...,  1,  and  last  passes 
through  stage  0.  Alternatively,  the  cell  i  of  the  network  can  be 
designed  as  a  multiplexer  which  conditionally  outputs  the  data  input  to 
cell  i,  data  from  Shuffle  \i),  data  from  Exchange(i),  or  data  from 
Shuff le'^Exchangeli)) .  Assume  that  at  any  stage,  a  shuffle  affects  all 
PEs,  and  that  any  exchange  signal  affects  both  PE(i)  and 
PE(Exchange(i)).  Unlike  Lawrie's  omega  network,  no  broadcast  functions 
will  be  allowed.  This  network  can  produce  all  the  permutations  of  the 
recirculating  Shuffle-Exchange  network.  Thus,  it  has  the  advantage  over 
the  multistage  Shuffle-Exchange  network  of  being  able  to  perform  one  to 
n-1  shuffles  in  one  pass  through  the  network. 

Let  P  =  -C  (Si,  Di)  |  0  £  i  <  N  >  be  a  permutation  mapping  of  source 
PE  address  Si  to  destination  PE  address  Di,  where  the  binary 
representation  of  Si  is  p^_^...p^Pg  and  that  of  Di  is  d^_^...d^dg.  A 
network  passes  a  permutation  P  if  and  only  if  P  conforms  to  the 
acceptable  form  for  that  network,  and  no  conflicts  result  in  the  passage 
through  the  network  CLAW753.  Z  ^  P  means  the  network  Z  passes  a 
permutation  P. 


data 

in 


Figure  IV. 8:  A  Shuffle-No  Shuffle-Exchange  circuit  for  row  i. 


0  <  i  <  N 


56 


Theorem  IV. 1 ;  An  n-stage  SNSE  network  can  form  the  following 
permutations.  Consider  two  pairs  of  acceptable  source-destination  tags 
pairs,  (Si,  Di)  and  (Sj,  Oj).  Then  for  all  i  and  j,  0  £  i^  3  <  f*’® 

acceptable  source  destination  pairs  have  the  following  form.  For  all  t 
such  that  0  £  t  <  n, 

4x/AxSi  .  ,  _n-(t+1)  Di  .  ,  _n-(t+1) 

1)(A)-^  modulo  2  =  modulo. 2 

(B)AND  (Si  #  Sj  =>  (  A  (  modulo  2*^  ^  A  I'l^l  modulo  2^  ^  ^  OR 

b=o 

I  modulo  2*^*^  A  l—^^l  modulo  2*^  ^3)3^ 

Or,  the  pairs  are  of  the  form, 

k  k 

23Si  ^  Sj  =>  (Si  modulo  2  ^  Sj  mod  2 

OR  l-^I  modulo  2*^  ^  j-^l  modulo  2*^  *^3 

2^  2'^ 

for  1  £  k  £  n,  where 

t 

A  <A(k33 
h=o 

is  the  logical  AND  of  all  A(k3  for  0  £  k  £  t. 

Proof;  An  n-stage  SNSE  network,  designated  Z,  will  be  analyzed  by  parts. 

Zg  will  refer  to  a  network  constructed  with  only  an  exchange  function, 
i.e.,  Zg  =  E.  Z^  prefixes  an  Exchange-Shuffle  stage  to  Z^,  i.e.,  Z^  = 

(ES3Zq  =  ESE.  Z2  prefixes  an  Exchange-Shuffle  to  Z^,  i.e.,  Z2  =  (ES3Z^ 


57 


=  (!-S)(ES)Zq  =  (ES)(ES)E.  For  0  <  t  <  n,  =  CES)Z^_^  =  (ES)’^E.  The 

last  prefix  creates  Z^,  an  n  stage  Shuffle-Exchange  network,  by 
prefixing  a  shuffle  stage  to  i.e.,  Z^  =  SlES)"^  ^E  =  (SE)"^.  The 

permutations  that  Z  can  pass  are  the  union  of  the  permutations  that 
Zq,  Z^,  .  .  .  ,  and  Z^  can  pass. 

Note  that  certain  cases  are  not  explicitly  considered  here.  For 
example,  a  shuffle  followed  by  k  No  Shuffles  is  equivalent  to  k  No 
Shuffles  followed  by  a  shuffle.  Also,  a  shuffle  followed  by  two 
exchanges  has  the  same  effect  as  just  a  shuffle.  The  first  cases  are 
not  considered  in  this  argument.  The  second  equivalent  ones  are. 
Without  loss  of  generality,  it  is  assumed  that  if  stage  i  is  a  Shuffle, 
then  so  is  stage  j,  for  all  j,  i  >  j  >1 

Zq  accepts  the  permutations 

A  conflict,  CO,  between  two  different  sources  Si  and  Sj  results  if  and 
only  if  Di  =  Dj,  that  is 

CO  =  (|Si/2|  =  |Sj/2|)  AND  (Di  modulo  2  =  Dj  modulo  2). 

So, 

Zq  f  P  <=>  (|Si/2|  =  |Di/2l)  AND 
(Si  4  Sj  =>  NOT(CO)),  n  <  i,j  <  N. 

Recall  that  Z^  “  (ES)^E.  For  0  <  t  <  n,  the  transition  of  data  is 

Vr**Po 

-->■  Pn-2*-*Pl‘^t‘^t-1 

•  •  • 

Vi-f-PiVt-v'^o* 


1 


58 


The  acceptable  permutations  are^  thus, 

|Si/2j  modulo  2  =  |Di/2  |  modulo  2 

For  Si  4  Sj,  a  conflict  results  after  k  shuffles  and  k+1  possible 
exchanges,  0  £  k  £  t,  if  and  only  if 

Ct(k)  =  (lSi/2|  modulo  2*^  ^  =  |Sj/2l  modulo  2^  S 

AND  (  |Di/2^''^|  modulo  2'‘‘*’‘'  =  lDj/2^"'‘|  modulo  2'''*’''). 

Thus,  the  expression  for  a  conflict,  Ct,  is 

t 

Ct=  V  <Ct(k)). 
k=0 


Therefore, 

Zj  t  P  <=> 

(|Si/2|  modulo  2  =  lDi/2  |  modulo  2  ) 

AND  (  Si  #  Sj  =>  NOT(Ct)),  0  <  i,j  <  N. 

The  combination  of  the  permutations  which  Zg  through  pass  is 

equivalent  to  the  above  statements  1)  (A)  and  (B)  in  the  statement  of 
Theorem  IV. 1. 

Z^  is  a  shuffle  stage  concatenated  with  Z^_^  and  is  an  n-stage 
Shuffle-Exchange  network.  From  CLAW75],  it  is  seen  that  Z^  accepts  a 
permutation  P  if  and  only  if 

Si  *  Sj  => 

NOT(  Si  modulo  2*^  =  Sj  modulo  2*^ 

AND  lDi/2*^|  modulo  2*^  =  |Dj/2*'l  modulo  2^  ''l, 

1  <  k  £  n. 

The  expression  is  equivalent  to  statement  2)  in  the  statement  above  of 
Theorem  IV. 1. 


59 


The  n-stage  network  Z  passes  all  permutations  that  are  passed  by 


/ 


Z  . 


n 


CT 


The  SNSE  network  can  form  all  the  permutations  of  the  recirculating 
Shuffle-Exchange.  If  x  passes  through  the  recirculating  network  are 
needed,  then  jx/nl  passes  through  the  SNSE  network  accomplish  the  same 
data  transfer.  This  speed-up  has  been  accomplished  at  the  cost  of  a  few 
extra  gates  per  PE,  5*N*n  for  the  SNSE  network  versus  7/2  N*n  for  the 
multistage  Shuffle-Exchange,  and  for  more  control  signals,  n*N/2  for  the 
multistage  Shuffle-Exchange  and  (1 (Shuffle)  +  N/2(Exchange)  + 
N/2(Shuf f le-Exchange) )  *  n  signals  for  the  SNSE  network. 


Combinational  Logic  Multistage  Networks 
and  Pipelined  Multistage  Networks 

In  section  2,  networks  were  analyzed  as  if  they  were  one  bit  wide. 
Data  words  may  be  sent  through  the  network  bit  serially,  but  other 
methods  may  be  more  efficient. 

Let  the  width  of  a  data  word  be  VI  bits.  A  network  may  be  designed 
as  W  planes,  where  each  plane  is  a  one  bit  wide  network  CLAW753.  As  an 
example,  consider  the  10-stage  generalized  cube  network  for  N  =  1024 
that  was  described  in  section  2.  The  number  of  packages  required  for  W 
=  32  is  5974  *  32  =  191,168  integrated  circuit  packages,  or  about  187 
per  PE. 

The  amount  of  hardware  could  be  reduced  by  a  compromise.  The  data 
word  could  be  divided  into  )4/B  =  ^  segments  of  data,  the  network 
constructed  as  a  £  bit  wide  network,  and  then  the  data  word  passed  in  £ 
passes  through  the  network.  In  this  manner,  if  the  delay  of  the  network 
is  £,  then  the  delay  to  pass  the  entire  data  word  through  the  network  is 
S*0.  If  the  number  of  gates  in  a  W  wide  network  is  £,  the  number  of 
gates  in  the  reduced  network  is  G*B/W  =  G/S. 

In  order  to  show  how  this  division  of  the  network  might  be  done, 
three  sample  hardware  designs  are  presented.  Design  1  moves  data  into 
and  retrieves  data  from  the  network  B  bits  at  a  time  under  software 
control.  This  method  is  slow,  but  requires  no  extra  hardware  for 
control.  Design  2  multiplexes  the  S  segments  into  and  out  of  the 
network.  The  U  bit  data  word  is  loaded  into  DTRin.  An  S-to-1 
multiplexer  supplies  the  network  with  each  segment  of  data  at  the  proper 
time.  An  S-to-1  demultiplexer  retrieves  the  segments  from  the  network 


61 


and  arranges  them  in  DTRout.  Since  the  maximum  delay  of  the  network  is 

much  less  than  one  instruction  cycle,  this  design  is  faster  than  design 

1.  But,  it  requires  more  hardware  for  the  W  bit  wide  DTRin  and  DTRout 

registers  and  the  B  bit  wide  multiplexer  and  demultiplexer.  Design  3 

constructs  DTRin  and  DTRout  from  B  S-bit  shift  registers.  DTRin  is  made 

from  parallel-in-serial-out  registers,  and  DTRout  is  made  from  serial- 

in-paral lel-out  registers.  Let  d,.  ....d-d^  be  a  data  word.  The  first 

W- 1  1  U 

register  of  DTRin  stores  bits  dg_^...d^dQ.  The  second  register  stores 
bits  ‘^2S-1  ■  ■  *'^S+l'*S'  register  stores  bits  d^_^  . .  .dy_g^^  dy_ 

g.  Each  clock  period,  the  least  significant  bit  of  each  of  the  B  shift 
registers  of  DTRin  is  presented  to  the  network,  the  DTRin  registers  are 
shifted,  and  the  next  B  bits  are  ready  to  be  presented.  During  the 
clock  period,  these  3  bits  propagate  through  the  combinational  logic  of 
the  network.  Each  clock  period,  at  the  output  of  the  network,  each  of 
the  B  shift  registers  of  DTRout  receive  one  bit  from  the  network.  After 
S  clock  periods,  the  S  bits  in  each  of  the  8  shift  registers  are 
presented  as  a  W  bit  word  to  the  PE.  DTRin  and  DTRout  will  be  treated  as 
W-bit  registers  by  the  balance  of  the  system.  This  design  is  faster 
than  design  1  and  requires  less  hardware  than  design  2. 

For  an  extra  cost,  overlap  parallelism  may  be  added  to  a  multistage 
network  to  reduce  the  total  time  to  move  S  segments  of  data.  Assume  that 
»*'e  network  is  B  bits  wide,  that  DTRin  and  DTRout  are  W  bits  wide,  and 
for  interfacing  between  the  DTRs  and  the  network  are  small 
-o*t  of  the  network.  Let  the  network  be  an  n  stage 
,  '  ,*isters  of  delay  dr  between  each  stage,  each  stage 

(TD.  IV. 9  illustrates  this  arrangement. 


age  • •  •  stage  < 

1  S 


I  for  a  pipe Li 


63 


The  delay  for  the  combinational  logic  multistage  network  is  the 
time  to  load  DTRin,  plus  the  time  to  pass  through  the  network,  plus  the 
time  to  load  DTRout .  The  cost,  n*cms*B,  considers  only  the  network  and 
not  DTRin  and  DTRout.  The  delay  of  the  pipelined  network  is  dr  +  n  * 
(dms  +  dr)  to  get  the  first  segment  from  the  network,  and  the  remaining 
segments  arrive  at  DTRout  in  the  next  S-1  time  delays.  The  cost  of  the 
pipelined  network  is  that  of  the  multistage  network  plus  cr*(n-1)*B  for 
the  N-bit  registers  that  are  placed  between  each  of  the  n  stages. 

The  S  segments  of  data  may  be  transferred  using  a  pipelined  network 
in  time 

Tp  =  dr  +  nCdr  +  dms)  +  (S  -  1)(dr  +  dms) 

=  dr  +  (dr+dms) (n  +  S  -  1). 

The  unpipelined  network  transfers  the  same  data  in  time 

Tm  =  S<dms  *  n  +  2  *  dr). 

If  the  last  segment  is  loaded  into  DTRout  as  the  next  segment  is  loaded 
into  DTRin,  then 

Tm  =  dr  +  S (dms*n  +dr) . 

Either  formula  can  be  used  for  the  analysis,  and  the  difference  in  the 
analysis  is  negligible. 

Figures  IV. 10,  IV. 11,  and  IV. 12  plot  Tp  vs  Tm  for  various  values  of 
N  and  S,  for  dr  =  9  ns  and  dms  =  10  ns.  These  time  approximations  are 
based  on  Schottky  logic  according  to  CTI76].  In  all  three  Figures,  for 
S  >  2,  the  pipelined  multistage  network  passes  data  in  less  time  than 
the  combinational  logic  multistage  network.  As  S  grows,  this  time 
difference  becomes  more  pronounced. 


67 


IV. 5.  Equal  Cost  Combinational  Logic  and 
Pipelined  Multistage  Networks 

Suppose  chat  a  combinational  logic  multistage  network  and  a 

pipelined  multistage  network  have  equal  cost.  Since  the  pipelined 

network  uses  more  hardware  than  the  combinational  logic  multistage 

network/  the  width  of  the  pipelined  network  is  less  than  the  width  of 

the  combinational  logic  multistage  network  of  equal  cost.  The 

relationship  between  S  /  the  number  of  data  segments  for  the  multistage 

fn 

network/  and  the  number  of  data  segments  for  the  pipelined  network/ 
i  s 


W  W 

Cn  *  cms  +  (n-1 )  *  cr)—  =  n  *  cms  * 


or 


S  =  S  *  (  n  *  cms  +  (n-1)  *  cr)/(n  *  cms). 
p  m 

Suppose  cms  =  cr,  that  iS/  the  cost  of  the  logic  for  one  stage  of  a 
multistage  network  is  about  that  of  the  cost  of  the  register  at  the 
output  of  that  stage.  If  the  number  of  gates  for  a  simple  flip-flop  and 
the  two  levels  of  NAND  gates  that  can  comprise  one  stage  of  the  network 
(or  a  multiplexer)  are  compared/  then  cms  =  cr  is  a  fair  estimate  of 
relative  cost.  If  the  package  count  is  considered/  since  D-type  flip 
flops  are  available  four  (SN74175)  or  six  (SN74174)  to  a  package/ 
cms  >  cr  may  be  true.  For  this  analysis/  cms  =  cr,  and  it  is  understood 
that  this  is  a  worst  case  estimate  for  pipelined  network  costs. 


J 


68 


If  cms  =  cr,  then 

(n  *  C  +  (n-1)  *  C)/S  =  n  *  C/S  ,  or 

p  m 

(n  +  (n-D)  *S  =  n*C*S,or 
tn  p' 

S  =  (2*n  -  1)  S  /  n  . 
p  m 

Table  IV. 2  lists  S  ,  S  -  Tp,  and  Tm  for  various  values  of  N  for 
combinational  and  pipelined  networks  of  equal  cost,  assuming  dms  =  10  ns 
and  dr  =  9  ns.  Figures  IV. 13,  IV. 14,  and  IV. 15  illustrate  the  data 
transfer  delay  time  variations  for  two  equal  cost  networks.  In  order  to 
move  W  bits  through  a  network,  for  a  small  number  of  segments  of  data, 
the  unpipelined  network  completes  the  data  transfer  in  less  time  than 
the  pipelined  multistage  network.  However,  as  the  number  of  segments 
grows,  the  pipelined  multistage  network  requires  less  time  to  complete 
the  transfer  than  the  same  cost  combinational  logic  multistage  network. 


Figure  IV. 13;  Cost  vs  delay  for  equal  cost  pipelined  and  combinational 
logic  networks  for  N  =  16. 


COST  =  n  *  cms  *  W_  FOR  TWO  EQUAL  COST 


NETWORKS  in  units  of  cms*W 


Figure  IV. 14:  Cost  vs  delay  for  equal  cost  pipelined  and  combinational 


logic  networks  for  N  =  128 


73 


^.6.  Average  Data  Transfer  Times 

Consider  the  average  time  to  pass  one  segment  of  data.  Let  both  a 
combinational  logic  multistage  network  and  a  pipelined  multistage 
network  have  the  same  width.  The  combinational  logic  multistage  network 
passes,  on  the  average,  one  segment  of  data  in  (dms*n+2dr)  time  units. 
The  pipelined  network  uses  an  average  of 

dr-Kn+S-1 )  (dms-tdr) 

S 

time  units/segment.  As  S  increases,  this  average  time  decreases.  These 
two  average  times  are  equal  for 

dms*n  t  2  dr  =  dr  .(n  S  -  1)(dms_^_dr) 

-  _  dr  +  (n  S  -  1)<dms  *  dr) 

(dms*n  +  2  dr) 

e  -  dms  dr  _  dr+Cn-IKdms  +  dr) 
dms*n  +  2  dr  dms*n  +  2  dr 

e  -  dr+ (n-1 ) (dms+dr) 

^  ■  (dms*Tn-r)+'dr1  - 

For  example,  if  dms=dr  and  n=10,  then 

„  _  dr+<10-1)(2dms)  _  19 
^  ■  (dms*  (10-1+1))  "  TU  ■ 

So,  for  S  ^  2,  the  average  delay  per  segment  is  less  for  the  pipelined 
network  than  for  the  combinational  logic  multistage  network,  although 
the  total  time  to  pass  one  data  item  may  be  greater,  due  to  the  time  to 
fill  and  empty  the  pipe.  This  suggests  that  a  pipelined  multistage 


74 


network  is  most  applicable  where  many  segments  of  data  are  passed,  such 
as  passing  blocks  of  data  at  once  rather  than  one  data  word. 


1 


MICROCOPY  RESOLUTION  TEST  CHAj^T 

NATIONAL  BUREAU  OF  STANDARDS-1963-j< 


5 


75 


^.7.  k-stage  Pipelined  Multistage  Networks 


An  n  stage  pipeline  may  not  be  best  for  a  particular  application. 
The  network  can  be  divided  into  a  k-stage  pipeline,  0  <  k  £  n,  where  n/k 
is  an  integer.  In  this  case,  k-1  extra  registers  are  added,  and  n/k 
stages  of  the  network  form  one  stage  of  the  pipeline.  The  analysis 
above  can  be  extended  to  a  k-stage  pipeline. 

The  time  to  pass  S  segments  of  data  for  the  k-stage  pipeline  is 

Tk  =  dr  +  k  (  n/k  *  dms  +  dr)  +  (S  -  1)(  n/k  *  dms  +  dr) 

=dr+  (k+S-1)(  n/k  *  dms  +  dr). 


The  cost  of  this  network  is 


Ck  =  n  *  cms  *  B  +  (k-1)  *  cr  *  B. 


If  S  is  fixed,  the  value  of  k  for  which  the  delay  is  minimum  is  of 
interest.  By  taking  the  derivative  with  respect  to  k  of  Tk,  and  setting 
it  to  zero,  this  minimum  can  be  found. 


dTk  _ 
dk 


dr  +  *  dms  *  (S-1) 

I.  ^ 


=  0 


or,  k  =  -/(n  *  dms  *  <S-1)/dr). 

If  n  =  10  and  dms  =  dr,  then  the  minimum  Tk  occurs  for  k  = 
-/10(S-1).  Table  IV. 3  shows  Tk  and  k  for  various  values  of  S.  As  S 
increases,  the  value  of  k  which  yields  the  minimum  Tk  also  increases. 
The  minimum  value  of  Tk  increases  with  k  and  S,  but  for  a  fixed  S,  the 
value  of  k  in  the  table  yields  the  minimum  Tk  for  any  pipelined  network. 


76 


Table  IV.3:  Ninimum  value  of  the  number  of  stages  in  a  pipelined 
multistage  network,  given  S,  the  number  of  segments  of  data  to  be 
passed,  and  the  corresponding  value  of  Tk. 


S  k  Tk  nearest  integer  value  n/k  Tk<n/k) 


2  3.16  17*dms  2  19*dms 

4  5.5  24*dffls  5  25*dms 


8 


8.36 


33*dms 


10 


35*dffls 


77 


Suppose  that  a  cofflbinational  logic  multistage  network  and  a  k-stage 

pipelined  multistage  network  have  equal  cost.  Since  a  one  bit  wide 

pipelined  network  uses  more  hardware  than  a  one  bit  wide  combinational 

logic  network^  the  width  of  the  pipelined  network  is  less  than  the  width 

of  the  combinational  logic  multistage  network.  The  relationship  between 

S  ,  the  number  of  data  segments  for  the  multistage  network,  and  S  ,  the 
fn  k 

number  of  data  segments  for  the  k-stage  pipelined  network  is 

(  n  *  cms  +  (k-1)  *  cr)  *  W/S,  =  n  *  cms  *  W/S  . 

k  m 

If  cms  =  cr,  then 

S,  =  S  *  (n  +  k  -  1>/n 
k  m 

for  networks  of  equal  cost.  Table  IV. 4  compares  S,  and  S  for  various  n 

K  m 

and  k. 

The  average  time  for  the  k-stage  pipeline  to  pass  one  segment  of 
data  is 

dr  *  (dms  *  n/k  dr)(k-»S-1)  time  units 
S  segment 

This  average  time  is  equal  to  the  average  time  for  the  combinational 
logic  multistage  network  to  pass  a  segment  when 

_  _  drt  fdms*n/k-<-dr)  (k-1 ) 
dms*(n-n/k)+dr 

If  dms  =  dr  and  n  =  10,  then 

-  .  k^»10k-10 
^  "  11k-10  • 


79 


For  k  =  10,  S  =  1.9.  which  is  the  result  of  section  6.  For  k  =  5,  S  = 

t 

I 

1.44,  and  for  k  =  2,  S  =  1 .  ' 

Consider  a  combinational  logic  multistage  network  and  a  k-stage 

pipelined  multistage  network  that  have  equal  cost.  Let  D  =  dms  =  dr. 

Then,  since  for  equal  cost  networks,  S  =  S  (n+k-1)/n, 

fC  rn 

Tm  =  S  (  n  *  dms  +  2*dr  )  =  S  *(n+2)*D,  and 
m  m 

Tk  =  dr  +  (k  +  S  *  (n  +  k  -  1)/n  -  1)*(n/k  ♦  1)  *  D. 
m 

Tk  >  Tm  if  and  only  if 

S  (n+2)<1+(k+S  *(n+k-  1)/n  -  1)(n/k  +  1), 

m  ffl 

or 

2  2  2 
c  <■  n  k  -  n  ■»•  nk 

m  2.  2^  .  2  .  • 

n  k-n  +n-k  +k 

This  relationship  implies  that,  for  some  values  of  S  ,  a  pipelined 

m 

network  can  be  designed  whose  cost  is  about  that  of  the  combinational 
logic  network,  but  whose  delay  to  pass  one  datum  (divided  into 
segments)  is  less  than  that  of  the  unpipelined  network  to  pass  the  datum 
(divided  into  S  segments).  To  illustrate  this,  consider  n=10,  S  =1, 

f(i  »n 

W=16.  Then,  for  the  combinational  logic  multistage  network,  B  =  16,  Cm 

=  160  C,  and  Tm  =  12  D,  where  D  =  dms  =  dr  and  C  =  cms  =  cr.  For  k=10, 

for  an  equal  cost  pipelined  network,  =  1.9.  If  we  choose  =  2,  B  = 

8,  then  Ck  =  152  C  and  Tk  =  dr  +  (11) (dms  +  dr)  =  23  D.  Here,  then, 

Tk  >  Tm  for  two  equal  cost  networks.  Alternatively,  if  S  =4  and  B  =  4, 

m 

then  Cm  =  40  C,  and  Tm  =  4(10+2)  0  =  48  D.  For  k=10,  =  4  (10  +  10  - 

1)/10  =  7.6.  If  we  choose  S.  =  8,  then  Ck  =  2(10  +  9)C  =  38  C  and  Tk  = 
0(1  +  (10+8-1) (2))  =  35  0.  So,  Tk  <  Tm  for  these  two  equal  cost 


80 


networks.  Using  the  relation  defined  above,  if  S  >  2.3,  then  a 

m 

pipelined  network  is  a  more  cost  effective  design,  since  for  about  the 
same  cost,  the  delay  is  less  that  for  an  unpipelined  multistage  network. 

Consider  the  following  "figures  of  merit"  for  evaluating  multistage 
networks.  For  a  combinational  logic  multistage  network. 


Tm*Cm 

0*C 


Wt*S  *(n+2)D*W/S  *n*C 
m  m 


D*C 


Wt*W*n*(n+2), 


where  Wt  is  the  number  of  words  per  data  transfer,  W  is  the  width  in 
bits  of  a  data  word,  D  =  dms  =  dr,  and  C  =  cms  =  cr. 

For  a  k-stage  pipelined  multistage  network. 


Tk*Ck  ^  +(k+S^*Wt-1 )  <n/k+1 )) *W/S^  (n+k-1  )C 

D*C  “  D*C 

or, 

Tk*Ck  _ 

(D*C) 

Wt*W<n/k  +  1)(n  +  k-1)+  W/S^^ln  +  k  -  1)(n  +  n/k  +  k). 

In  general,  for  a  given  Wt,  the  smaller  the  figure  of  merit,  the 

better  the  performance  for  a  given  cost.  Note  that  Tm  *  Cm/(D  *  C)  is 

not  a  function  of  S  ,  since  Tm  is  proportional  to  S  and  Cm  is 

m'  m 

proportional  to  1 /S  .  As  S  increases,  the  delay  time  Tm  increases,  but 
rn  fn 

the  cost  Cm  decreases.  As  Wt  grows,  the  magnitude  of  this  delay-cost 
^.roduct  reflects  the  unsuitability  of  a  combinational  logic  multistage 
network  for  transfering  large  blocks  of  data.  Note  also  that  Tk  * 


81 


"rr  "Tsns*"  '.--s* 


*  C)  is  a  function  of  but  is  not  as  sensitive  to  changes  in  Wt  as  is 
Tm  *  Cffl/(D  *  C> ,  As  Wt  and  increase^  the  delay-cost  product  for  the 
pipelined  multistage  network  changes  to  reflect  the  usefulness  of  the 
network  for  moving  large  blocks  of  data. 

As  an  example,  let  n=10  and  W=16.  For  a  combinational  logic 
multistage  network.  Table  IV.5  shows  the  figures  of  merit  for  various 
Wt.  For  a  k-stage  pipelined  network,  the  corresponding  figures  of  merit 
are  shown  in  Table  IV. 6  for  various  Wt,  Sj^,  and  k. 

This  analysis  indicates  that  an  unpipelined  multistage  design  is 
preferable  to  a  pipelined  design  only  if  the  number  of  segments  per  data 
word  and  the  number  of  words  per  transfer  are  small.  For  example,  if 
large  blocks  of  data  are  to  be  transferred  between  PEs,  the  pipelined 
network  is  a  better  choice  for  a  design  than  the  combinational  logic 
multistage  network.  If,  however,  the  width  of  the  data  word  is  the  same 
as  the  width  of  the  network  and  data  transfers  occur  one  word  at  a  time 
at  infrequent  intervals,  then  the  combinational  logic  multistage  network 
is  more  cost  effective  than  the  pipelined  network. 


82 


Table  IV. 5:  Figures  of  merit  for  combinational  logic  multistage  net¬ 
works  for  N  =  102A  and  U  =  16. 

Wt  Tm  *  Cffl/(D  *  C> 

1  I  1920 

8  I  15,360 

16  I  30,720 


Table  IV. 6;  Figures  of  merit  for  pipelined  multistage  networks 
for  N  =  1024  and  W  =  16. 


Wt  S.  k  Tk  *  Ck/(D  *  C) 
k 


1 

10 

1 

6992 

8 

1 

1406 

16 

1 

1007 

1 

5 

1 

4480 

8 

1 

1148 

16 

1 

910 

1 

10 

1 

11,284 

8 

1 

5662 

16 

I 

5263 

1 

5 

1 

9184 

8 

1 

5852 

16 

1 

5614 

1 

10 

1 

16,112 

8 

1 

10,526 

16 

1 

10,127 

1 

5 

1 

14,560 

8 

1 

11,228 

16 

1 

10,990 

* 


83 


Custom  Integ rated  Circuit  Implementation  Considerations 

Custom  integrated  circuits  are  attractive  for  building 
interconnection  networks  since  delay  times  and  chip  counts  and,  thus, 
cost  of  the  network,  can  be  reduced.  Another  potential  advantage  of  such 
an  implementation  is  the  reduction  of  wires  between  stages  of  the 
network.  The  connections  that  such  wires  represent  can  become  a  major 
portion  of  the  system  expense.  But,  some  restrictions  must  be  made.  In 
order  to  be  cost  effective,  the  implementation  must  use  one  chip  as  the 
building  block  for  the  network,  and  connect  these  modules  to  form  the 
interconnection  network.  If  the  modules  are  packaged  in  dual  inline 
packages,  then  the  number  of  pins  is  limited  to  about  60.  More  pins, 
and  so  more  flexible  packaging,  can  be  obtained  using  ceramic  chip 
carriers. 

This  section  will  suggest  designs  for  custom  integrated  circuit 
implementation  of  multistage  Cube  (Figure  III. 6)  and  PM2I  (Figure  III. 7 
and  Figure  IV. 3)  networks.  It  assumes  that  the  logic  on  a  chip  is 
essentially  free  and  that  the  guiding  factors  in  any  design  are  the 
number  of  pins  per  chip  and  the  number  of  wires  between  chips  needed  to 
construct  the  network. 

To  reduce  the  number  of  wires  that  are  not  on  a  building  module, 
several  means  are  available.  Consider  the  generalized  cube  network  shown 
in  Figure  III. 6.  For  any  stage  of  the  network,  if  more  than  one 
interchange  box  for  that  stage  were  placed  on  the  building  module,  some 
wire  length  could  be  saved.  More  wires  can  be  eliminated  if  two  stages 
are  merged  onto  one  module.  That  is,  given  a  set  of  interchange  boxes 
in  stages  i  and  i-1,  if  the  two  stages  are  merged  onto  one  chip,  the 


84 


wires  out  of  stage  i  and  into  stage  i-1  can  be  eliminated. 

For  a  generalized  cube  network,  if  each  building  module  is  a 

complete  generalized  cube  network  of  size  2’,  then  the  number  of  wires 
between  chips  can  be  minimized.  One  possibility  is  the  generalized  cube 
for  2  =4  PEs  shown  in  Figure  IV.Ib.  This  module  requires  four  control 

signals,  8W  pins  for  input  and  output  of  data,  where  W  is  the  width  in 
bits  of  the  data  word  to  be  transfered,  and  two  pins  for  power/ground. 
A  further  savings  can  be  realized  by  noting  that  the  module  is 

combinational  logic,  and  so  the  boolean  functions  it  realizes  can  be 
implemented  by  two  levels  of  combinational  logic.  Thus,  the  data 
transfer  delay  through  the  module  can  be  reduced  from  four  levels  of 
logic  to  two.  The  four  functions  can  be  expressed  as  follows.  Let  the 
four  inputs  be  lO,  11,  12,  and  13,  and  the  four  outputs  be  00,  01,  02, 
and  03.  For  two  function  interchange  boxes,  four  control  signals, 
c(i,j),  i,  j  <  -C  0,  1  >  are  available.  If  c(i,j)  =  1,  then  an  exchange 

occurs.  If  c(i,j)  =  0,  then  the  straight  state  is  selected.  Then, 

00  =  7(0, 0)7(0, 1)  10  +  c(0,0)7(0,1)  12 
+  7(1,0)c(0,1)  II  +  c(1,0)c(0,1)  13 
01  =  7(1,0)7(0,1)  II  +c(1, 07(0,1)  13 
+  7(0,0)c(0,1)  10  +  c(0,0)c(0,1)  12 
02  =  7(0,0)7(1,1)  12  +  7(1,0)c(1,1)  13 
+  c(0,0)7(1,1)  10  +  c(1,0)c(1,1)  II 
03  =7(1,07(1,1)  13  +  7(0,0)c(1,1)  12 
+  c(1, 0)7(1, 1)  II  +  c(0,0)c(1,1)  10. 

For  N  =  2^,  stages  i  and  i+1  can  be  merged.  The  four  inputs  into 
the  building  module  at  stage  i  have  addresses 


-  ' - 


86 

Pn-r*'Pi+1°‘'Pi-2-**PlP0' 

Pn-r**Pi+l''‘'Pi-2*”Pl’^0’ 

There  are  other  possibilities  for  selecting  the  logic  for  a 
building  module  for  an  interconnection  network.  But,  for  the  generalized 
cube  network,  the  module  described  by  the  four  equations  above 
represents  an  implementation  which  is  least  expensive  in  terms  of  the 
number  of  off  chip  wires  required  to  build  the  network. 

Theorem  IV. 2:  Let  the  modules  in  Figure  IV. 16  be  the  chip.  Cl,  used  to 
implement  two  stages,  i  and  i“1,  of  a  generalized  cube  network.  Let  C2 
be  another  chip  composed  of  four  interchange  boxes  in  stages  i  and  i-1, 
and  C2  f  Cl.  Then,  for  any  C2,  an  implementation  of  a  generalized  cube 
using  Cl  is  a  less  expensive  implementation  of  the  network  than  one 
using  C2. 

Proof:  In  Figure  IV. 17,  two  Cl  chips  implement  the  circuit.  One 
implements  interchange  boxes  0,  1,  2,  and  3,  and  the  other  implements 
interchange  boxes  4,  5,  6,  and  7.  No  off  chip  wires  are  required  for 
connections  between  interchange  boxes  in  stage  i  and  interchange  boxes 
in  stage  i-1.  Call  these  wires  interstage  wires.  Consider  the  eight 
interchange  boxes  of  Figure  IV. 17.  Any  chip  C2  as  defined  above  spans 
four  of  these  boxes  in  one  of  three  ways. 

Case  1:  Here,  C2  groups  four  boxes  as 

(  0,  1,  4,  5  ),  and  (2,  3,  6,  7  ),  or 
(  0,  1,  6,  7  ),  and  (  2,  3,  4,  5  ). 


I 

f 


88 


For  this  case,  the  number  of  interstage  wires  between  stage  i  and  i-1  is 
N.  Therefore,  an  implementation  of  the  generalized  cube  using  C2  is  more 
expensive  than  one  using  Cl,  which  has  no  interstage  wires  between 
stages  i  and  i-1. 

Case  2:  C2  spans  the  eight  boxes  by  picking  one  from  the  set 

(  0,  1,  2,  3  )  and  three  from  the  set  (  4,  5,  6,  7  ),  or  vice  versa. 

Then,  the  number  of  interstage  wires  between  stages  i  and  i-1  is  four 

for  each  pair  of  chips  C2.  So,  any  implementation  using  C2  is  more 

expensive  than  one  using  Cl. 

Case  3;  C2  groups  interchange  boxes  (  1,  3,  4,  6  ),  (  0,  2,  4,  6  ), 

(  0,  2,  5,  7),  or  <  1,  3,  5,  7  ) .  Because  the  four  interchange  boxes  do 

2 

not  form  a  complete  network  of  size  2  =  4,  as  Cl  does,  interstage  wires 

must  connect  different  chips  of  type  C2.  The  number  of  interstage  wires 
is,  for  N  >  4,  2(N/4)  =  N/2.  Therefore,  any  implementation  using  C2  is 
more  expensive  than  one  using  Cl.  FT 

A  multistage  F>M2I  network  can  be  constructed  from  multiplexers 
which  select  one  of  several  inputs  as  the  output  from  the  circuit.  For 
this  discussion,  assume  that  each  switching  cell  of  the  network  (Figure 
III. 7)  receives  control  signals  independent  of  any  other  cell  of  the 
network.  At  stage  i,  an  output  line  from  the  stage,  the 

switching  cell  position  with  address  P  at  stage  i ,  n  ^  i  ^  0,  0  ^  P  <  N, 
selects  data  from  one  of  three  inputs.  These  are 

(1)  with  the  "straight  across"  control,  i.e.,  PM2^.  =  0  =  PM2_.,  from 
P.^,  the  line  which  is  P^^^  stage  i+1, 

(2)  with  the  PM2_.  control,  from  P  +  2^  from  stage  i+1,  or 


89 


(3)  with  the  control,  from  P  -  2^  from  stage  i+1. 

Such  a  selection  circuit  can  be  represented  by  the  multiplexer  of  Figure 
IV. 18.  This  multiplexer  uses  4W  gates  and  two  inverters  for  the 
controls,  3W  data  input  lines,  U  data  output  Lines,  two  control  lines, 
and  two  lines  for  power  and  ground. 

Multistage  PM2I  networks  can  be  implemented  as  custom  integrated 
circuits  by  a  combination  of  two  methods.  One  method  combines  the 
circuitry  for  two  Lines  at  stage  i,  as  shown  in  Figure  IV. 19.  Another 
combines  the  circuitry  for  Line  P  from  stages  i  and  i+1  into  one 
circuit,  as  shown  in  Figure  IV. 20.  In  specifying  the  circuits  for  a 
multistage  PM2I  network  implementation,  assume  that  all  mappings  of  data 
from  input  PEs  to  output  PEs  are  not  necessarily  permutations,  that  is, 
broadcasting  data  from  one  PE  to  several  others  is  allowed. 

One  way  of  decreasing  the  number  of  wires  between  chips  is  to  merge 
two  or  more  multiplexers  at  stage  i  into  one  chip.  Consider  merging  two 
multiplexers,  P  and  P',  into  one  chip  with  input  from  stage  i+1  and 
output  to  stage  i-1. 

Theorem  IV. 3;  Suppose  two  multiplexers  are  to  be  merged  into  one  chip  at 
stage  i.  Then,  the  choices  which  remove  the  most  off-chip  wires  and 
have  the  lowest  number  of  pins  per  chip  are 

multiplexers  P  and  <P  +  2^)  modulo  N,  and 
multiplexers  P  and  (P  -  2^)  modulo  N. 

Proof:  Consider  the  choice  of  multiplexers  P  and  (P  +2’)  modulo  N  (  the 
case  of  P  and  (P  -  2^)  modulo  N  is  analogous  ).  For  this  discussion, 
assume  that  all  additions  and  subtractions  are  modulo  N.  Both 


DATA  FROM  P-2  mod  N 
FROM  STAGE  i-^l 

DATA  FROM  P 
FROM  STAGE  l-«-l 


CONTROLS 


DATA  FROM  P' 
FROM  STAGE  l-«-l 

DATA  FROM  P'-^2  *  mod  N 
FROM  STAGE 


OUT 

TO  STAGE  1-1 


P' 

OUT 

TO  STAGE  1-1 


p'  M  ^2  >  mocJulo  N 


Figure  IV. 18:  A  circuit  for  Line  P  at  stage  i  selects  data  from  line  P, 
line  (P  +  2^)  modulo  N,  or  line  (P  -  2^)  modulo  N  of  stage  i+1.  The 
multiplexer  selects  some  input  to  become  the  output  P^^^  based  on  the 


set  of  control  signals  input  to  it. 


91 


DATA 

DATA 

DATA 

DATA 

DATA 

DATA 

DATA 


FROM 

FROM 

FROM 

FROM 

FROM 

FROM 

FROM 

FROM 

FROM 

FROM 

FROM 

FROM 


CP—2  ~2  >  mod 

STAGE  1 

<P-2  ^  >  mod  N 

STAGE  1 

CP-2  >  mod  N 

STAGE  1 

P,^  FROM  STAGE  1 
IN 

CP-^2  >  mod  N 

STAGE  1 

CP-^2  *  >  mod  N 
STAGE  1 

<P-*-2‘  mod 


N 


N 


MUX 


P 


FROM  STAGE  1 


P 

OUT 

TO  STAGE  1-2 


Figure  IV. 19:  At  stage  i  of  a  multistage  PM2I  network^  two  multi¬ 
plexers,  P  and  P’  can  be  constructed  on  one  integrated  circuit  chip. 


92 


DATA  FROM  P-2  mod  N 
FROM  STAGE  — 


DATA  FROM  P 


IN. 


FROM  STAGE 

DATA  FROM  P-^2  ^mod  N. 
FROM  STAGE 


CONTROLS  PM2  AND  PM2 

•^1  -1 


MUX 

P 


ZJ 


OUT 


TO  STAGE  1-1 


Figure  IV. 20:  The  circuits  for  multiplexer  P  at  stage  i  and  multiplexer 
P  at  stage  i-1. 


i 


93 


multiplexers  have  common  inputs  from  lines  P  and  P  +  2^  from  stage  i+1 
that  do  not  need  to  be  replicated.  The  chip  then  becomes  that  of  Figure 
IV. 19.  This  chip  has  8W  gates  and  four  inverters  for  control  signals^ 
4W  data  input  lines,  2W  data  output  lines,  four  control  lines,  and  two 
lines  for  power  and  ground.  The  total  number  of  pins  for  this  chip  is 
6W  +4  +2.  Any  other  choice  of  two  multiplexers  with  no  common  data 
input  lines  has  8W  gates  and  four  inverters  for  control  signals,  6W  data 
input  lines,  2U  data  output  lines,  four  control  lines,  and  two  lines  for 
power  and  ground.  This  gives  a  pin  count  of  8W  +  4  +  2  pins.  T3 

Consider  merging  two  multiplexers  in  adjacent  stages,  the 
multiplexer  P  at  stage  i  and  the  multiplexer  P  at  stage  i-1.  Figure 
IV. 20  shows  this  arrangement.  The  data  which  become  the  input  to  stage 
i-2  can  originate  in  the  following  positions  at  stage  i+1.  Again, 
assume  that  all  additions  and  subtractions  are  modulo  N  arithmetic. 

(1)  multiplexer  P,  if  stages  i  and  i-1  are  set  to  straight  across, 

(2)  multiplexer  P  +  2^,  if  at  stage  i  the  datum  is  affected  by  PM2_^., 
and  at  stage  i-1  by  a  straight  across  movement, 

(3)  multiplexer  P  -  2^,  if  at  stage  i  the  datum  is  affected  by  PM2^. 
and  at  stage  i-1  by  a  straight  across  movement, 

(4)  multiplexer  P  -  2^~\  if  the  datum  is  affected  by  "straight"  at 

stage  i  and  by  PM2^.  at  stage  i-1, 

(5)  multiplexer  P  +  2^  if  the  datum  is  affected  by  "straight"  at 

stage  i  and  by  PM2_.  at  stage  i-1, 

(6)  multiplexer  P  ♦  2^  +  2^“^,  if  at  stage  i  the  datum  is  affected  by 

PM2_.  and  at  stage  i-1  by  and 


94 


(7)  multiplexer  P  -  2’  -  2^  if  at  stage  i  the  data  is  affected  by 

PM2^.  and  at  stage  i+1  by  PM2  ..  ... 

Figure  IV. 21  shows  the  resulting  multiplexer  for  position  P.  This  chip 
has  3W  gates  and  four  inverters  for  control  signals,  7U  input  lines,  W 

output  lines,  four  control  lines,  and  two  lines  for  power  and  ground, 

for  a  total  of  3W  +  6  pins  per  chip. 

If  s  stages  are  merged  beginning  with  stage  i,  n  >  i  ^  s-1,  stages 
i,  i“1,  i“2,  .  .  .  ,  i  -  (s-1),  then  the  resulting  chip  on  which  the 


network  is  based  needs 

<1+2+4+...+2*)W=  (2**^  -  1)W  input  lines,  one  from  each 


of  P,  P  +  i' ,  P  -  2\  P  +  2‘'"\  P  -  2''"\  ...  ,  P  +  2^  +  2’"\  P  - 

2-2  ,...,P+2+2  +...+2  ,  and  P  -  2  -  2 

_  ,i-(s-1) 

•  •  ■  *  k  • 

c4>1 

—  (2  -  1  +  1)  W  gates  plus  inverters  for  control  signals, 

—  W  output  lines, 

—  2s  control  lines,  and 

—  2  lines  for  power  and  ground. 

s+1 

This  gives  a  total  pin  count  of  2  W  +  2s  +  2  pins  per  chip. 

Theorem  IV. 3  and  the  stage  merging  techinque  presented  above  can  be 
using  in  combination  to  meet  design  spectifications  and  limitations. 
,  Stage  merging  decreases  the  number  of  wires  between  stages,  but 
increases  the  number  of  inputs  to  each  building  module.  Theorem  IV. 3 
reduces  the  number  of  input  wires  needed  by  combining  multiplexers  with 
common  data  input  lines. 


,i-1 


.i-1 


95 


FROM  P—2  mocJ  N 
FROM  STAGE  ‘ 

FROM  P 

IN  - 

FROM  STAGE  l-*-l 

FROM  P’^S  mod  N  ■ 
FROM  STAGE  i-*-! 

CONTROLS 


FROM  P—2  mod  N 
FROM  STAGE  i 


MUX 

—  ■  ' 

MUX 

P 

P 

‘■"II  * 

FROM 

-  I  P*2  ‘  ~  ^mod  N  ' - 

FROM  STAGE  l  - 

STAGE  1  STAGE  1-1 


OUT 

TO  STAGE  1 

i-2 

CONTROLS 


Figure  IV. 21:  One  fflultiplexer  behaves  the  same  as  two  stages,  stages  i 
and  i-1,  of  the  network  for  position  P.  The  equivalent  circuit  receives 
input  from  stage  i+1  and  outputs  data  to  stage  i-2.  It  selects  data 


from  one  of  seven  places.  |  |i 


96 


IV. 9.  Conclusions 

The  Cube,  Shuf f le-Exchange^  and  PflZl  networks  were  implemented  as 
both  recirculating  and  multistage  networks.  For  each  design,  the  amount 
of  hardware  and  delay  were  calculated.  The  designs  provide  a  basis  for 
comparing  the  cost  and  data  transfer  time  delays  of  multistage  and 
recirculating  networks. 

The  Shuffle-No  Shuffle-Exchange  network  was  introduced.  This 
network  is  a  compromise  between  the  recirculating  and  multistage 
Shuffle-Exchange  networks,  since  it  can  perform  from  one  to  n-1  shuffles 
in  one  pass  through  the  network.  For  a  small  amount  of  additional 
hardware,  the  Shuffle-No  Shuffle-Exchange  network  can  perform  all  data 
transfers  that  the  multistage  Shuffle-Exchange  network  performs,  as  well 
as  all  the  data  transfers  which  can  be  performed  in  from  1  to  n-1  passes 
through  a  recirculating  Shuffle-Exchange  network. 

Pipelined  multistage  networks  were  examined  and  compared  to 
combinational  logic  multistage  networks  of  equal  delay  to  pass  data  and 
to  combinational  logic  multistage  networks  of  equal  cost.  Both  n-stage 
and  k-stage  pipelined  networks,  where  n/k  is  an  integer,  were 
considered.  Insight  has  been  gained  into  the  performance  and  cost  of 
pipelined  multistage  networks  and  the  trade-offs  between  pipelined 
multistage  networks  and  combinational  logic  multistage  networks. 

The  analyses  developed  in  this  chapter  can  be  used  to  evaluate  the 
effectiveness  of  a  given  interconnection  network  for  a  specified  task. 
Chapter  VII  presents  three  image  processing  tasks  and  uses  the  results 
of  Chapter  IV  to  compare  the  times  for  interprocessor  data  transfer  for 
several  different  networks. 


V.  PARTITIONING  INTERCONNECTION  NETWORKS 

V.1|_.  Introduction 

Multimicroprocessor  systems  with  as  many  as  2^^  processors  are 
being  predicted  for  the  future  CPEA??^  SUB77].  This  implies  that  it  is 
important  for  an  interconnection  network  for  such  a  system  to  be 
partitionable  into  subnetworks  so  that  the  system  can  be  partitioned 
into  subsystems. 

Some  computations  may  be  more  efficiently  executed  if  the  N  PEs  of 

n—  p  p 

the  SIMD  machine  can  be  partitioned  into  2  groups  of  2  PEs.  Each 
group  may  then  behave  like  a  smaller  SIMD  computer,  and  the  system 
resources  may  be  more  efficiently  utilized.  As  an  example,  consider  an 
algorithm  that  is  composed  of  k  independent  procedures,  each  of  which 
can  be  executed  using  2^  or  fewer  PEs.  Suppose  that  each  procedure 
requires  at  most  j  time  periods  to  execute.  The  algorithm  could  be 
performed  one  procedure  at  a  time,  using  at  most  2  PEs  at  any  time. 
The  computation  would  require  at  most  j*k  time  periods,  and  2^  -  2^  PEs 
would  be  unused.  If  on  the  other  hand,  the  2^  PEs  were  partitioned  into 
2*^  ^  groups  of  2^  PEs,  2*^  ^  of  the  k  procedures  could  be  performed  in 


parallel  (assuming  that  all  groups  use  the  same  instruction  stream  or 


that  multiple  control  units  are  available).  Now^  the  computer  is  more 

n*  r 

fully  utilized,  and  if  k  £  2  ,  the  algorithm  can  be  executed  in  j  time 

periods. 

Two  classes  of  partitioning  for  an  SIMD  machine  will  be  considered: 

n*  r  r 

(1)  use  the  system  as  one  to  2  SlflD  machines,  each  having  2  PEs, 
each  performing  the  same  algorithm  (using  the  same  instruction  stream), 
but  each  on  a  different  data  set;  and  (2)  use  the  system  as  many  SIMD 
machines  of  varying  sizes,  each  size  a  power  of  two,  where  each  may  be 
performing  a  different  algorithm  on  a  different  data  set.  The  first 
class  will  be  referred  to  as  single  control  partitioning,  since  only  one 
control  mechanism  will  serve  all  of  the  PEs,  as  shown  in  Figure  V.1. 
The  second  class  will  be  referred  to  as  multiple  control  partitioning, 
since  a  separate  control  mechanism  will  be  required  for  each  instruction 
stream,  as  shown  in  Figure  V.2.  Note  that  multiple  control  partitioning 
includes  having,  for  example,  four  groups  following  one  instruction 
stream,  while  four  other  groups  each  follow  their  own  instruction 
streams.  Furthermore,  it  may  be  the  case  that  two  or  more  groups  may  be 
executing  different  SIMD  programs  on  the  same  logical  data  set,  although 
each  group  would  have  its  own  physical  copy.  Examples  of  machines  such 
as  these  that  have  been  proposed  are  CNU77,  SIE78a,  SMSM78,  BS78, 
BFHP79D.  The  single  control  class  is  a  special  case  of  the  multiple 
control  class,  where  all  the  groups  use  one  control  unit.  Since  there 
is  only  one  instruction  stream,  a  conventional  SIMD  machine,  with  one 
control  unit,  can  provide  this  instruction  stream. 

When  processors  of  the  machine  are  partitioned  into  groups  of  size 
2^,  each  submachine  must  function  as  an  independent  entity.  So,  each 


[Ill] 


control  units. 


Each  of  2*^  ^  control  units  commands  2*"  PEs. 


Two  control 


r+1 


units  can  be  linked  to  form  an  SIMD  machine  of  size  2 


101 

section  of  the  partitioned  interconnection  network  must  function  as  a 
complete  network.  That  is,  every  section  must  have  all  of  the 
capabilities  of  a  network  of  the  given  type  built  for  a  size  2  system. 

This  chapter  considers  partitioning  interconnection  networks  using 
both  single  and  multiple  control  partitioning.  Section  2  considers 
multistage  networks.  Recirculating  networks  are  analyzed  in  section  3. 
Section  4  analyzes  pipelined  multistage  networks,  and  section  5 
considers  the  use  of  variable  size  partitions,  where  all  partitions  are 
of  size  a  power  of  two. 


102 


This  section  shows  how  multistage  Shuffle-Exchange,  Cube,  and  PM2I 
networks  can  be  partitioned. 

Partitioning  a  multistage  Shuffle-Exchange  network  (Figure  III. 3) 
is  discussed  in  CLAS76I!.  That  discussion  is  extended  here  and  is 
related  to  the  two  types  of  partitioning  defined  above. 


D"  r  r 

Theorem  V.l :  Let  a  system  be  partitioned  into  2  subsystems  of  2  PEs 

each.  For  single  control  partitioning,  if  only  a  shuffle  and  no 

exchange  is  allowed  on  each  of  the  first  n-r  stages  of  the  network,  the 

network  can  serve  as  a  complete  r  stage  Shuffle-Exchange  network  for 
r 

each  of  the  2  groups  of  PEs  of  the  partitioned  machine. 

Proof:  Let  the  addresses  of  all  PEs  in  each  partition  have  the  same  n-r 
most  significant  bits.  Let  loc.  (P)  refer  to  the  location  of  the  data 
originally  in  PE(P)  after  passing  through  i  stages  of  the  network 
l!LAST76D.  Let  the  destination  address,  the  address  of  the  PE  which 
receives  the  data  from  P,  be  D  =  d^_^...d^dQ.  After  n-r  shuffles  with 
no  exchanges  or  broadcasts  (the  first  n-r  stages  of  the  network  are  in 
the  straight  state),  the  intermediate  address  of  the  data  that  was 
initially  in  F  is 

loc^_/P)  = 

'  Pr-r  •*Pl'’o'^n-r-*Pr‘ 

The  next  stage  is  the  first  that  allows  an  exchange.  So, 

’■°‘=n-r+1  ^  Pr-2*”PlPoPn-VPr'^r-1' 

where  d^_^  =  p^_^  or  The  final  address  of  the  data,  after  r 

Shuffle-Exchange  steps,  is 


103 


I 


loc  (P)  =  p  *...p  d 

n  n-1  r  r-1  1  0' 

where  d.  =  p.  or  p..  ALL  PEs  with  addresses  that  have  the  same  n-r  most 
1  1  1 

significant  bits  will  be  in  the  same  partition  of  the  machine.  At  stage 
i,  r  <  i  £  0,  of  the  partitioned  network,  any  interchange  box  compares 
data  from  two  Lines  whose  addresses  differ  only  in  the  i^^  bit  position. 
If  the  n-r  most  significant  address  bits  are  ignored,  these  are  the  same 
two  Lines  which  are  the  inputs  to  the  interchange  box  in  a  ShuffLe- 
Exchange  network  for  2^  PEs.  In  this  manner,  a  PE  can  communicate  with 

any  other  PE  in  its  partition,  just  as  if  it  had  a  Shuf f Le-Exchange 

r  ... 

network  of  size  2  .  ALso,  no  PE  can  disturb  a  PE  not  in  its  partition. 

Figure  V.3  iLLustrates  this  partitioning  for  N  =  16  PEs  partitioned  into 

four  groups  of  four  PEs  each.  CD 


An  analogous  argument  may  be  made  to  show  that  the  PEs  may  be 
partitioned  based  on  any  common  set  of  n-r  bits  of  the  PE  addresses. 


CoroLLary  ^  system  be  partitioned  for  singLe  control 

n*  r  r 

partitioning  into  2  groups  of  2  PEs.  Let  aLL  PEs  in  a  group  have 
some  set  of  n-r  address  bits  in  common,  and  Let  these  n-r  bits  be  the 
same  set  of  n-r  bits  for  alL  groups.  Then  the  network  can  serve  as  a 

n—  f* 

complete  r  stage  Shuffle-Exchange  network  for  each  of  the  2  groups  of 
PEs  of  the  partitioned  machine. 

Proof:  After  any  stage  of  the  Shuffle-Exchange  network,  exactly  one  bit 
of  the  intermediate  address  of  the  data  can  be  changed.  Stage  n-1 
affects  only  address  bit  n-1,  so  that  after  stage  n-1. 


105 


<P>  =  P„-2-"PlOoVl' 


where  =  ^n-l  stages  n-2  through  0  affect  the  intermediate 

address  location  of  the  data  in  analogous  manners;  stage  i  affects 

address  bit  and  changes  it  to  d.. 

Let  the  n  -  r  common  PE  address  bits  within  all  partitions  be 

p.  ,  ...  ,  p.  ,  where  i.  <  <  0,  ...  ,  n-1  1  £  j  £  n-r.  Since 

n-r  M  ^ 

there  is  a  one  to  one  correspondence  between  the  stage  number  and  the 
address  bits  that  stage  changes,  the  network  can  be  partitioned  on  any 

set  of  address  bits.  If  p.  is  one  of  the  n-r  common  PE  address  bits, 

^  j 

then  at  stage  i.,  the  exchange  function  is  disallowed.  CD 


If  the  Shuffle-Exchange  network  employs  individual  box  control, 
both  single  control  partitioning  and  multiple  control  partitioning  can 
be  supported  by  the  network.  For  example,  consider  a  three  stage 
Shuffle-Exchange  network  with  two  function  interchange  boxes  that  is  to 
be  partioned  into  two  groups  of  four  PEs  each.  Let  the  two  groups  be  G1 
=  (0,  1,  2,  3)  and  G2  =  (4,  5,  6,  7).  Consider  the  case  where  only  G1 
is  using  the  network.  Let  c  be  a  control  matrix  for  the  network  of 
Figure  V.4,  which  represents  the  control  signals  the  interchange  boxes 
receive.  c(i,j)  controls  the  i^^  row  interchange  box  at  stage  n-1-j. 
An  exchange  occurs  at  that  module  if  and  only  if  c(i,j)  =  1.  (This 
argument  assumes  two  function  boxes,  but  can  be  extended  for  four 
function  boxes.)  To  control  the  network  for  C1,  the  matrix  is 


107 


c(0,1) 


c(2^1) 


c(0,2) 

c(1,2) 


Only  the  control  bits  that  relate  to  the  submachine  that  is  using  the 
network  need  be  specified.  All  others  may  be  ignored.  To  control  the 
network  for  G2  alone^  the  matrix  is 

"^0  -  -  ■ 

0  c(1,1) 

c  =  1  0  -  c(2,2) 

0  c(3,1)  c(3,2)J 

If  G1  and  G2  are  to  share  the  network  under  a  single  control 
partitioning  scheme^  set  c<0,1)  =  c(1,1)^  c(2,1)  =  c(3^1),  c(0^2)  = 

c(2^2)^  and  c(1,2)  =  c(3^2).  For  multiple  control  partitioning,  G1  is 
controlled  by  c(0,1),  c<2,1),  c(0,2),  and  cC1,2),  while  62  is  controlled 
by  c(1,1),  c(3,1),  c(2,2),  and  c<3,2).  Thus,  if  c(i,0)  =  0,  0  <  i  <  A, 
the  two  partitions  can  operate  independently  and  asynchonously. 

For  four  function  interchange  boxes,  the  same  c(i,j)  control  G1  and 
G2.  For  four  function  boxes,  however,  c<i,j)  <  C  0,  1,  2,  3  >,  where 
c<i,j)  =  0  means  straight,  c(i,j>  =  1  means  exchange,  c(i,j)  =  2  means 

lower  broadcast,  and  c(i,j)  =  3  means  upper  broadcast. 


Corollary  An  n  stage  Shuffle-Exchange  (omega)  network  with 
individual  box  control  c(i,j)  can  be  partitioned  in  a  multiple  control 
environment,  if  all  partitions  are  of  size  2  . 


L 


108 


n  n*  r  r 

Proof:  Divide  the  2  PEs  into  2  groups  of  2  ,  so  that  the  n-r  most 
significant  address  bits^  o'^  alL  PEs  in  group  G  are  the  same. 

Since^  the  first  n-r  stages  allow  only  a  shuffle  and  no  exchange,  the 
entries  of  the  first  n-r  columns  of  c  are  all  0,  and 
loVr  <P)  =  P,_1...P^P0P„.1...P,. 

Stage  r-1  is  the  first  to  allow  an  exchange  to  follow  the  shuffle.  Let 

c(A,  n-r)  be  a  control  bit  for  stage  r-1  where  A  =  a^_2a^_j. . .a^a^, 
n“1 

since  there  are  2^'  rows  of  c.  Then  c(A,  n-r)  controls  PEs  belonging 
to  C  if 


A^a  ^...a  p  ....p  . 
n-2  n-r^n-1  '^r 

That  is,  A  is  the  address  of  an  interchange  box,  0  £  A  <  n/2.  At  stage 
r-1,  interchange  box  A  affects  PEs  of  group  G  if  and  only  if  the  last 
n-r  address  bits  of  A  equal  the  first  n-r  address  bits  of  any  PE  in  G. 

At  stage  r-(1+j),  1  £  j  <  r,  c(A,  n-r+j)  controls  PEs  of  G  if 

A  =  V2***^n-r+jPn-r"'^r®j-r”®0* 

For  single  and  multiple  control  partitioning,  if  the  group  of  PEs  is 
known,  the  control  signals  that  refer  to  that  group  are  given  by  the 
above  expression.  The  expression  can  be  extended  for  any  set  of  common 
n-r  address  bits  used  for  partitioning  and  for  four  function  interchange 
boxes.  A  broadcast  function  sends  one  datum  to  both  outputs  of  the 
interchange  box.  The  effects  are  the  same  as  sending  data  first  to  the 
output  for  the  "straight  across"  setting,  the  address  of  which  is  in  the 
partition,  and  then  to  the  output  for  the  "exchange"  setting,  which  is 
also  in  the  partition.  That  is,  for  a  broadcast,  both  outputs  are  part 


of  the  same  partition,  and  no  data  is  transferred  outside  its  partition. 


109 


In  CSIS78]^  it  is  shown  that  the  indirect  binary  n-cube  and  the 
generalized  cube  are  both  topologically  equivalent  to  the  multistage 
Shuffle-Exchange  network.  Since  these  three  networks  have  equivalent 
topologies,  the  partitioning  proofs  are  similar. 


Corollary  V.3;  The  indirect  binary  n-cube  network  and  the  generalized 
cube  network  with  individual  box  control  can  be  partitioned  in  a 
multiple  control  environment  if  all  partitions  are  of  size  2^. 

Proof;  Corollary  V.3  follows  from  the  topological  equivalence  of  the 
indirect  binary  n-cube,  the  generalized  cube,  and  the  multistage 
Shuffle-Exchange  network,  and  from  Theorem  V.1.  The  only  difference 
between  these  two  networks  and  the  omega  network  discussed  in  Corollary 
V.2  is  the  position  of  the  control  bits  in  the  control  matrix.  For 
example,  for  N=8,  consider  G1  as  defined  above.  For  the  generalized 
cube,  the  control  matrix  would  be 


as  shown  in  Figure  V.5.  For  the  indirect  binary  n-cube,  if  c'(i,})  = 


c(i,n-1-j),  then: 


c'  = 


c'(0,0)  c'(0,1) 

c’(1,0)  c'(1,1) 


as  shown  in  Figure  V.6.  In  both  cases,  the  stage  which  affects  address 


bit  P2  is  set  to  no  exchange.  So,  data  cannot  be  exchanged  between  a  PE 


110 


c(O.l)  ■  c(0,2) 


Figure  V.5:  If  Pts  0^  2,  and  3  use  the  network^  then  a  generalized 

cube  network  uses  the  controls  above  to  affect  data  passage  through  the 


network 


Figure  V.6:  An  indirect  binary  n-cube  network  controls  data  from  PEs  0 
1,  2,  and  3  using  the  controls  above. 


in  G1  with  address  OXX  and  a  PE  in  G2  with  address  1XX,  where,  as  in  the 


PE  address  mask  notation  (Chapter  II),  X  is  "don't  care,"  i.e.,  either  0 
or  1.  CD 


In  the  STARAN  flip  network  CBA76D  (Figure  III. 4),  stage  i  can 
perform  the  Cube-type  interconnection  function  Cube..  The  Shuffie- 
Exchange  and  flip  networks  are  topologically  equivalent  CSIS78D, 
and  thus  the  partitioning  proofs  are  similar. 


Theorem  y^.£:  The  STARAN  flip  network  with  flip  control  may  be 
partitioned  under  single  control  partitioning,  but  not  under  multiple 
control  partitioning. 

Proof;  The  control  vector  F  of  the  flip  network  allows  the  partitioning 

n— r 

of  the  network.  In  order  to  partition  the  machine  into  2  groups  of 

2^  PEs,  where  the  addresses  in  each  group  have  the  same  n-r  most 
significant  bits,  the  control  vector  used  is 


F  =  (  0  ...  0  f  ...  f.  f-  ). 

n-r  1  0 

For  an  arbitrary  PE  address,  P,  the  first  r  stages  of  the  network 
produce 

loc  (P)  =  p  ....p  d  ....d-, 
r  n-1  r  r-1  0' 

where  d^  =  p.  or  p . .  If  the  last  n-r  stages  allow  no  exchanges,  the 
address  of  the  final  destination  of  the  data  is 

(P>  = 

Thus,  all  PEs  with  the  same  n-r  most  significant  bits  will  be  in  the 
same  partition  of  the  reconfigurable  machine.  Again,  this  procedure  can 
be  generalized  to  group  PEs  that  have  any  set  of  n-r  bits  in  common.  For 
example,  if  the  n-r  least  significant  addresses  bits  are  the  same  for 


113 


all  PEs  in  a  group,  then 

F  =  (  f  .  ...  f  0  ...  0  ). 
n-1  n-r 

Since  each  stage  of  the  flip  network  receives  only  one  control  bit, 
multiple  control  partitioning  is  not  possible.  m 


Corollary  V.4;  The  generalized  cube  network  with  individual  stage 
control  can  be  partitioned  under  single  control  partitioning,  but  not 
under  multiple  control  partitioning. 

Proof;  Since  the  generalized  cube  network  and  the  STARAN  flip  network 
are  topologically  equivalent  CSIS78],  Corollary  V.4  follows  from  Theorem 


The  STARAN  also  allows  a  set  of  "shift  permutations."  The  shift 
controls  for  the  STARAN  provide  i+1  control  signals  at  stage  i  to  move 
data  at  input  x  to  output  x  +  2*"  modulo  2^,  0  ^  m  ^  p  ^  n.  This  set  of 
controls  is  illustrated  in  Figure  III. 4b  for  N=8.  In  general,  the 
controls  at  stage  i,  ciO,  cil,  ...,  and  cii,  affect  groups  of  PEs  as 
follows.  Recall  from  Chapter  II  that  a  set  of  PE  addresses  can  be 
expressed  as  a  string  of  O's,  1's,  and  X's,  where  X  is  a  don't  care, 
i.e.,  either  0  or  1.  Also,  X^  is  a  string  of  i  X's  in  the  specification 
of  a  PE  address. 

ciO:  X  0 

cil: 

c12: 


„n-i.v'*”1 
cn  :  X  IX 


114 


Referring  to  Figure  III. 4b,  cOO  =  OA,  clO  =  1A,  c11  =  IB,  c20  =  2A,  c21 
=  20,  and  c22  =  2C.  Under  certain  conditions,  these  controls  can  be 
effectively  used  for  single  control  partitioning.  However,  they  allow 
independent  use  for  no  groups  of  PEs  if  2*^  ^  groups  of  Z**  PEs, 
1  <  r  <  n,  must  use  the  network  at  one  time  under  multiple  control 
partitioning. 

In  general,  the  STARAN  flip  network  with  shift  control  cannot 

n*  p  p 

always  oe  partitioned  into  2  groups  of  2  PEs  using  single  control 
partitioning.  Consider  the  case  for  N  =  8  and  r=2,  where  the  two  groups 
of  PEs  are  GO  =  -C  XOX  >  and  G1  =  C  XlX  >.  Physical  stage  0  is  logical 
stage  0.  Since  the  middle  bit  position,  p^,  determines  the  partitioning, 
physical  stage  1  must  be  set  to  straight  in  order  to  isolate  the 
partitions.  Physical  stage  2  is  logical  stage  1.  In  a  STARAN  network 
for  four  PEs,  logical  clO  affects  all  PEs  in  a  group  whose  addresses  end 
in  0,  i.e.,  PEs  CXX0>,  and  logical  c11  affects  all  PEs  in  a  group  whose 
addresses  end  in  1,  i.e.,  PEs  <XX1>.  But,  physical  c22  affects  all  PEs 
XIX.  It  affects  some  that  should  respond  to  logical  clO,  i.e.,  PEs 
CX10>,  and  some  that  should  respond  to  logical  c11,  i.e.,  PEs  CXIIT. 
So,  single  control  partitioning  is  not  possible  for  this  case. 

Theorem  V.3;  Let  a  system  be  partitioned  into  2*^  ^  groups  of  2*"  PEs.  If 
the  n-r  PE  address  bits  which  define  a  partition  are  the  n-r  most 
significant  address  bits  or  the  n-r  least  significant  address  bits,  then 
the  STARAN  fl’p  network  with  shift  controls  can  be  partitioned  under 
single  control  partitioning. 


115 


Proof;  If  the  n-r  most  significant  address  bits  determine  the 
partitioning,  then  physical  stage  i  is  logical  stage  i,  0  £  i  <  r. 
Stages  r  through  n-1  are  set  to  no  exchange  (straight  through  the 
stages).  All  control  signals  affect  the  same  set  of  PEs  of  a  partition. 
For  example,  at  stage  i,  0  £  i  <  r,  ciO  affects  PEs  with  addresses 
^0^.  Since  n-i  >  n-r,  all  PEs  with  addresses  whose  least  significant 
bits  are  0^  in  all  partitions  are  affected  by  ciO.  cil  affects  PEs 

^1.  Since  n-i  >  n-r,  all  PEs  with  addresses  whose  least  significant 
bits  are  0^  ^1  in  all  partitions  are  affected  by  cil.  So,  single 

control  partitioning  is  possible. 

If  the  partitioning  is  based  on  the  n-r  least  significant  "’E 

address  bits,  then  all  PEs  in  a  partition  have  the  same  n-r  least 
significant  bits.  These  n-r  bits  remain  the  same  at  each  intermediate 

address  location  of  the  data  as  it  passes  through  the  network.  So, 

stages  0  through  n-r-1  are  set  to  no  exchange  for  all  PEs.  Physical 
stage  n-r  is  logical  stage  0  for  the  partitions.  One  logical  control, 
cOO,  is  needed,  so  all  n-r+1  controls  of  the  stage  are  set  to  cOO. 

Stage  k  =  n-r+1  is  logical  stage  1.  Note  that  n-(n-r+1)  = 

r-1 .  Logical  control  clO  affects  PEs  X  0  of  all  partitions,  that  is, 
PEs  X*"  ^OX^  The  physical  controls  that  become  clO  and  the  PEs  they 

ckO; 

ckl:  x''""'no""'“^”'l 
ck2;  X^"”'00^"'“^^1X 

•  •  • 

ck(k-l);  X^"^0ix"“''"\ 


affect  are 


116 


Logical  c11  affects  PEs  ^IX*^  So,  logical  c11  is  physical  ckk. 

For  i  >  1,  Logical  stage  i  is  physical  stage  r-r+i.  Let  k  =  n-r+i. 
Then  n-k  =  n  -  (n-r+i)  =  r-i.  Logical  ciO  affects  PEs  X*^  ^0^  of  all 
partitions,  i.  e.,  PEs  X^  ^O^X^  The  physical  controls  that  become 
ciO  and  the  PEs  they  affect  are 
ckO: 

ck1:  x'-'o'o""'"’! 

•  •  • 

ck(k-i):  x''"''o’lx"“''"\ 

Logical  controls  cij,  0  <  j  <  i,  are  similar.  Logical  cii  is  physical 
control  signal  ckk. 

For  example,  let  N  =  8  (as  in  Figure  III. 4b)  and  r  =  2.  The  system 
3-2  2 

is  divided  into  2  =2  groups  of  2  "4  PEs  each.  Physical  stage  G  is 

set  to  straight.  Logical  stage  0  is  physical  stage  1,  so  logical 
control  cOO  is  set  to  physical  controls  clO  and  c11.  Logical  stage  1  is 
physical  stage  2.  Logical  clO  affects  the  four  PEs  tX0X>.  So,  physical 
controls  c20,  (which  affects  PEs  tX00» ,  and  c21,  (which  affects  PEs 
tX01»,  become  logical  control  clO.  c11  affects  PEs  GX1X>,  and  so 
physical  c22  becomes  logical  c11.  CD 

Theorem  V.4;  The  STARAN  flip  network  with  shift  controls  can  not  be 
partitioned  under  multiple  control  partitioning,  if  2^  ^  groups  of  2^, 
1  <  r  <  n,  use  the  network  at  one  time.  v 

Proof;  For  multiple  control  partitioning,  if  all  groups  are  of  size  2 
and  so  require  only  one  stage  of  the  network,  then  2  groups  of  two 
PEs  can  independently  use  the  network.  This  is  because  if  two  PEs  whose 


117 


addresses  differ  in  the  i^^  bit  position  exchange  data,  then  Cube,  is 
executed.  If  they  do  not  exchange  data,  then  they  need  not  use  the 
network. 

Consider  multiple  control  partitioning  for  2^  groups  of  2^  PEs, 
1  <  r  <  n. 

Case  1:  2  groups  of  2^  ^  PEs  (r  =  n-1). 

Here,  n-1  stages  of  the  STARAN  network  are  needed  so  that  each 

group  may  have  a  complete  network  to  use.  Since  stage  0  has 

only  one  control,  the  n-1  stages  must  be  stages  1  through  n-1. 

Stage  1  supplies  one  control  signal  for  each  group,  and  so  is 

logical  stage  0  for  each  partition.  Stage  2,  however,  has  only 

three  controls,  not  the  necessary  four  (  two  for  each 

n“1 

partition  ).  So,  2  groups  of  c  cannot  use  the  network 
independently. 

Case  2:  2*^  ''  groups  of  2^,  1  <  r  <  n-1. 

0*  r 

Logical  stage  0  must  have  2  independent  controls,  one  for 

p 

each  group  of  2  PEs.  Suppose  physical  stage  i  is  logical 

stage  0  for  each  group  of  the  partition.  Stage  i  must  have  i+1 

n*  r  r 

2  independent  controls,  each  control  affecting  2  PEs. 

But,  cii  affects  PEs  \  i.e.,  2^  ^  PEs.  Since 

p 

2  >2,  1<r<  n-1,  stage  i  does  not  provide  enough 

controls  so  that  each  group  of  PEs  may  use  the  network 
independent  of  any  other  group.  CD 


i 


118 

Corollary  V.5;  The  generalized  cube  with  i+1  control  signals  at  stage  i 
can  be  partitioned  under  single  control  partitioning,  if  the  PE  address 
bits  which  define  the  partition  are  the  n-r  most  significant  bits  or  the 
n-r  least  significant  bits.  Also,  the  network  cannot  be  partitioned 
under  multiple  control  partitioning. 

Proof;  Since  the  generalized  cube  is  topologically  equivalent  to  the 
STARAN  flip  network,  this  corollary  follows  from  Theorem  V.3  and  Theorem 

V.4. 

The  data  manipulator  CFE74D  is  an  n  stage  network  where  stage  i  can 
perform  (D),  PM2_.  (U),  and  no  change  (H),  as  shown  in  Figure 

III. 7.  Each  control  is  further  divided  into  2  sets  as  shown.  For  stage 
i,  U^,  D^,  and  control  those  PEs  whose  i^^  bit  is  0,  and  U2, 
control  those  PEs  whose  i^^  bit  is  1. 

This  multistage  network  may  easily  be  divided  into  2  groups  of  2 
adjacent  PEs  <N'  =  2).  Let  the  two  PEs  be  PE(i)  and  PE(i+1  modulo  N'), 
for  i  even.  Since  N'  =  2,  the  network  need  consist  only  of  (note 

that  for  N'  =  2,  is  the  same  as  PM2_q).  This  can  be  obtained 

using  0^112  stage  0  and  on  all  other  stages.  Similarly,  PE(i) 

and  PE  (i+1  modulo  N'),  for  i  odd,  can  exchange  data  if  U.|(>2  used  on 
stage  0  and  is  used  on  all  other  stages.  This  is  true  for  multiple 

control  partitioning  since  PEs  not  exchanging  data  can  ignore  the 
network. 


/ 


■•if  '■ 


i  ; 

i  ‘ 

(.  • 
f 

S'  i 

s  i 

t  ; 


As  an  additional  point,  consider  pairing  two  arbitrary  PEs. 

Theorem  V.5;  For  the  data  manipulator  network,  for  the  control  functions 
given  as  in  Figure  III. 7,  if  only  two  PEs  use  the  network  at  one  time 
(single  control  partitioning),  then  any  two  PEs  can  exchange  data  in  one 
pass  through  the  network. 

Proof;  For  N  PEs,  let  PE  A  and  PE  B  exchange  data,  where  A  =  a^_^...aQ, 

B  =  b^_^...bQ,  and  A  ^  B.  In  order  for  PE  A  and  PE  B  to  exchange  data, 

the  destination  address  of  the  data  from  A  is  a'  ....a'^  =  b  ....b,., 

n-i  0  n~l  U 

and  the  destination  address  of  the  data  from  D  is  b’  ....b’_  = 

n~l  U 

a^_^...aQ.  An  algorithm  to  perform  such  an  input  address  to  output 
address  translation  is  as  follows. 


1: 

ij 


For  stage  i,0j<i<n: 

If  a.  =  b.  ,  then  no  change 

If  a.  =  0  and  b.  =  1,  then  apply  PM2^^  to  the 

data  from  A,  and  apply  PM2_^  to  the  data  from 
B  U2).  Thus, 


a*  . 

1 

=  a. 

1 

+  1 

=  1  =  b. 

b'  . 

1 

=  b. 

1 

-  1 

=  0  =  a. 

If  a.  =1  and  b.  =  0,  then  apply  PM2  .  to  the 

data  from  A,  and  apply  PM2^.  to  the  data  from 
B  (D^  U^).  Thus, 


a'.=a.  -1=0=b.,  and 

11  i' 

b.’  =  b.  +  1  =  1  =  a.. 

1  1  1 


MiMiBiiiii 


120 


For  this  algorithm^  after  the  i^^  stage^ 

loc.  (A)  =  b  ^...b  .a  .  ....a.,  and 
1  n-1  n-1  n-1-1  0 

Loc  •  (B)  ~  a  ^  •  aS  ab  .  >•  *  * 

1  n-1  n-1  n-1-1  0 

Finally,  after  n  stages, 

loc  (A)  =  b  ....b.b™,  and 
n  n-1  1  0' 

loc  (8)  =  a  ^...a.a^. 
n  n-i  1  u 

Table  V.  1  shows  the  control  settings  for  stage  i  that  result  front 
the  execution  of  the  algorithm  above.  CD 


This  algorithm  may  be  adapted  for  both  the  n  stage  Shuffle-Exchange 

and  the  flip  networks.  Let  PEs  P  and  Q  exchange  data.  For  the  Shuffle- 

Exchange,  if  bit  i  of  the  two  PE  addresses  are  the  same,  no  Exchange 

n“1 

occurs  at  stage  i,  and  c<A,  n-1-i)  =0,  0£A<2  .  If  the  two  bits 

differ,  then  an  exchange  occurs,  and  c(A,  n-1-i)  =  1,  0  £  A  <  2*^  \  For 

the  flip  network,  at  stage  i,  if  the  i^^  bits  of  P  and  Q  are  the  same, 
then  f(i)  =  0,  and  no  exchange  occurs.  If  the  two  bits  differ,  then 
f(i)  =  1,  and  an  exchange  occurs. 

If  a  restriction  is  made,  the  data  manipulator  network  can  be 
partitioned  such  that  each  group  has  a  complete  data  manipulator  network 

at  its  disposal.  For  example.  Let  N  =  8,  and  let  the  machine  be 

1  2 

partitioned  into  2  groups  of  2  PEs.  Let  one  group  contain  PEs  0,  2, 
4,  and  6,  while  the  other  group  contains  PEs  1,  3,  5,  and  7.  The 

control  structure  of  this  network  is  not  general  enough  to  allow 
multiple  control  paritioning,  and  so  only  single  control  partitioning 
will  be  considered.  If  only  the  straight  through  path  (  )  is 

allowed  at  stage  0,  then  each  group  of  4  PEs  has  a  log2  4  stage  data 


Table  V.l:  The  data  manipulator  network  can  exchange  two  data  items  if 


the  following  controls  are  applied  at  stage  i^  0  <  i  <  n 


the  data  manipulator  network 


122 


manipulator  network  available  to  it.  The  physical  stage  1  now 
corresponds  to  a  logical  stage  0  for  each  partitioned  group^  since  each 
PE  address  differs  from  the  addresses  of  its  neighbors  by  two  modulo  8. 
PM2^^  moves  data  from  position  0  to  position  2,  from  2  to  4,  from  4  to 
b,  and  from  6  to  0.  Similarly^  the  physical  stage  2  corresponds  to  a 
logical  stage  1  for  the  group.  Figure  V.7  illustrates  this  process. 

n—  p 

Theorem  V.6;  Let  the  data  manipulator  be  partitioned  into  2  groups  of 
2*^  PEs  so  that  all  PEs  in  a  partition  have  the  same  n-r  low  order  bits. 
Under  this  restriction,  the  data  manipulator  can  be  partitioned  under 
single  control  partitioning. 

Proof:  Each  group  is  composed  such  that  consecutive  PEs  within  a  group 

have  PE  addresses  that  differ  by  2^  * .  Then,  stage  n-r  behaves  as  a 

stage  0  for  the  group,  stage  (n-r+1)  behaves  as  a  stage  1,  and  so  on, 

and  lastly,  stage  n-1  behaves  as  a  stage  (r-1)  for  the  group  of  2^  PEs. 

Stage  i,  0  £  i  <  n-r,  must  be  set  in  the  "H"  state  so  that  address  bits 

p  .  through  p-  are  the  same  for  both  the  input  address  and  the  output 
n— r— 1  U 

address  of  the  data  passing  through  the  network.  For  all  j, 
(P^_^...Pq)  =  p'^_^ . . .p' jPj_^ . . .P^Pq.  That  is,  adding  or  subtracting 

one  in  bit  position  j  cannot  affect  bit  position  i,  0  £  i  <  j,  of  a  PE 
address.  So  stage  j,  n-r  £  j  <  n,  cannot  affect  address  bit  i,  0  £ 
i  <  n-r,  and  the  partitions  are  independent.  C!1 

Corollary  V.^:  Let  the  N  PEs  be  partitioned  as  in  Theorem  V.6.  Then  the 
augmented  data  manipulator  (ADM)  can  be  used  under  single  or  multiple 
control  partitioning,  if  all  partitions  are  the  same  size. 

a 

I 

l-r 


123 


I 

N 

P 

U 

T 


0 

U 

T 

P 

U 

T 


Figure  V.7:  For  N  =  8,  a  multistage  PM2I  network  can  be  partitioned 
into  two  groups  of  four  PEs.  Group  A  is  PEs  0,  2,  and  6.  Group  B  is 
PEs  1,  3/  5,  and  7. 


12A 


Proof;  Follows  from  Theorem  V.6  and  the  definition  of  the  ADM  network. C] 

There  are  means  of  allowing  2  PEs  with  consecutive  addresses  to  be 
grouped  together  for  the  data  manipulator  network  and  stvll  have’  a 
complete  network  available.  However,  problems  arise  due  to  missing 
connections,  such  as  the  0  to  2  -  1  connection  for  PM2_g. 

Let  the  2^  PEs  be  divided  into  2*^  *"  groups  of  2^  consecutive  PEs. 
Each  group  no  longer  has  a  complete  r-stage  data  manipulator  network  to 
use.  For  example,  let  N=32  and  r=3.  Then  the  partitioning  is  shown  in 
Table  V.2,  where  the  logical  number  of  a  PE  is  its  residue  class,  modulo 

P 

2  ,  and  PTi  means  partition  i. 

The  partitions  of  the  network  do  not  have  all  2r  logical 
interconnection  functions  available  for  use.  For  example,  6+2=0 
modulo  8,  but  there  is  no  physical  connection  between  PEs  6  and  0,  since 
the  network  was  built  for  N=32,  not  N=8.  These  missing  PM2I  connections 
will  be  referred  to  as  missing  end-around  connections  since  they  connect 
PEs  at  opposite  ends  of  a  group. 

Theorem  The  augmented  data  manipulator  (ADM)  network  can  be  used 

p 

for  single  control  partitioning,  if  the  PEs  are  partitioned  into  2 
groups  of  2*^  consecutive  PEs. 

Proof:  Consider  the  case  where  the  virtual  PM2. .  interconnection 
-  +1 

function,  for  some  0  £  i  <  r,  is  a  missing  end-around  connection  for  the 
data  item  I  that  is  in  cell  y  just  prior  to  stage  i  (the  case  for  PM2_. 
is  similar).  Let  cell  y  be  the  b*^  cell  in  the  p^^  partition  of  size 
2^,  that  is: 


125 


Table  V.2;  Partitioning  based  on  consecutive  PEs  using  the  augmented 
data  manipulator  network. 


Logical 

Physical 

number 

rjotes 

number 

PTO 

PT1 

PT2 

PT3 

0 

0 

8 

16 

24 

need 

PM2  .. 
-J 

3 

0,1,2 

1 

1 

9 

17 

25 

need 

PM2 

-3 

3 

1,2. 

2 

2 

10 

18 

26 

need 

PN2 

-3 

3 

= 

2. 

3 

3 

11 

19 

27 

4 

4 

12 

20 

28 

5 

5 

13 

21 

29 

need 

3 

= 

2. 

6 

6 

14 

22 

30 

need 

PM2^., 

3 

= 

1,2. 

7 

7 

15 

23 

31 

need 

PM2^., 

3 

= 

0,1,2, 

126 


r  1 

If  PM2^.  is  a  missing  end-around  connection,  then  b  ^  2  -  2  .  *^2+^ 

sends  this  data  to  cell 

y1  =  y  +  2^  =  lp2''  +  b  +  2^)  modulo  N 

=  (((p  +  Dmod  2*^  ^)2^  ♦  b')  modulo  N, 

where  b'  =  b  +  2^  -  2*',  0  £  b'  <  2^. 

Thus,  the  data  is  in  cell  b*  of  partition  p+1  modulo  2^  instead  of 
i  r 

cell  b+2  modulo  2  of  partition  p.  Figure  V.8  illustrates  the  errant 
data  movement  for  M  =  8  and  2^  =  A.  When  the  PM2^^  function  is  applied 
to  PE(2)  of  partition  0,  data  moves  to  PE(4),  which  is  logical  PE(0)  of 
partition  1.  The  correct  data  movement  would  be  to  PE(0)  of  partition  0. 
Since  2^  -  2^  £  b  <  2*"  and  i  <  r, 

<b  +  2^)  modulo  2^  =  (b  +  2^  -  2^)  modulo  N. 

Thus, 

y1  -  2^  modulo  N  =  p2''  +  (b  +  2^  modulo  2*')  modulo  N. 

If  stages  i-1  through  0  are  "H"  for  data  item  I,  then  if  the  network  is 

set  so  that  stage  r  is  "U"  for  item  I,  the  missing  end-around  connection 
will  be  compensated  for  before  it  occurs.  Figure  \/.9  shows  the 
correction  for  the  example  of  Figure  V.8.  The  flow  of  data  is  from 
physical  PE(2)  (logical  PE(2)  of  partition  0)  to  PE(6)  =  PE(PM2_2  (2)) 
to  PE(0)  =  PE<PM2^^  (6)). 

If  PM2_.  is  the  missing  end  around  connection,  then  an  argument 
similar  to  the  one  above  shows  that  setting  stage  r  to  PM2^^  for  the 
appropriate  cells  corrects  any  false  data  movement. 

Consider  the  case  where  after  passing  through  the  rest  of  the 
network,  data  item  I  is  moved,  after  stage  i,  from  cell  yl  to  y1  +  d, 

-2^  <  d  <  2^,  i.e.,  stages  i-1  through  0  are  not  all  "H"  for  item  I.  As 


PARTITION  1 


Figure  V.9:  Applying  Pn2_^  to  PE(2)  before  applying  corrects  for 

the  missing  end-around  PM2..  connection  for  PE(2>. 


129 


opposed  to  the  d=n  case  above,  when  -2^  <  d  <  2^  -  2^  -  b  no 

compensation  is  needed,  i.e., 

y1  +  d  modulo  N  =  (p2^  +  b  +  2^  +  d)  modulo  N 
=  (p2'”  +  (b  +  2^  +  d)  modulo  2^^)  modulo  N, 

2'“  -  2^  <  b  <  2'',  i  <  r. 

Figure  V.IO  illustrates  the  above  case.  In  Figure  V.8,  the 

interconnection  function  moved  data  from  PE(2>  to  PE (4).  If 

interconnection  function  PM2_q  is  applied  to  PE(4),  then  the  data  moves 
to  PE(3)  of  partition  0.  That  is,  the  net  distance  the  data  from  PE(2) 
moved  was  <  +  2  -  1  )  =  +1,  i.e.,  from  PE(2)  to  PE(3)  of  partition  0. 


r  i  i 

When  2  -  2  -  b£d<2,  the  only  compensation  that  is  needed  is 

p 

a  single  -2  ,  as  in  the  d=0  case  above.  Since 

(y1  +  d  -  2^)  modulo  N  =  (p2^  +  b  +  2^  +  d  -  2^)  modulo  N. 


Then,  z''  -  2^ 

-  b  <  d  <  2’, 

so 

0  <  b+2%d-2''  <  2^"^^  -  2''  +  b. 

Furthermore, 

since  2^  -  2^  < 

b  < 

2^,  then  0<b  +  2''  +  d-2''<  , 

where  i  <  r.  Thus,  b  +  2^  +  d  -  2^  modulo  N  =  b  +  2^  +  d  modulo  2^.  As 
an  example,  suppose  that  it  is  desired  to  move  data  from  PE (2)  to 
PE((2  +  3)  modulo  4=1)  of  partition  0  by  applying  the  virtual 
interconnection  functions  Pf12^^  and  PM2^q.  Setting  stage  r  to  PM2_2  for 
PE (2)  causes  the  desired  movement  of  data.  Figure  V. 11  illustrates  the 
correction. 

The  above  techniques  of  grouping  consecutive  PEs  are  limited  to 
single  control  partitioning,  since  data  will  cross  partition  boundarielD 

In  summary,  for  multiple  control  partitioning,  the  data  manipulator 
network  is  adequate  to  divide  the  PEs  into  2  groups  of  two  PEs  where 


PARTITION  0 


to  PE(2)  before  applying  and  PM2^q 

sent  to  PEC  2  +  2  +  1  modulo  4  )  =  PE(1)  of 


132 


a  group  consists  of  PE(i)  and  PE{i+1  modulo  N)^  0  <  i  <  N.  With  single 
control  partitioning,  certain  sets  of  2*^  PEs  may  have  an  r  stage  network 
available,  but  the  grouping  of  2^  PEs  is  more  restricted  than  with  the 
STARAN  flip  and  Shuffle-Exchange  networks.  Like  the  flip  network,  the 
control  structure  here  is  not  general  enough  to  allow  multiple  control 
partitioning. 

The  augmented  data  manipulator,  with  its  flexible  control 
structure,  can  be  partitioned  under  single  or  multiple  control 
partitioning,  if  the  addresses  of  the  PEs  in  a  partition  have  the  same 
n-r  least  significant  bits. 

The  transition  from  a  combinational  logic  multistage  network  to  a 
pipelined  multistage  network  does  not  change  the  overall  interconnection 
or  control  structure  of  the  network.  Therefore,  the  pipelined  network 
must  partition  in  the  same  manner  as  the  combinational  logic  multistage 
network  upon  which  it  is  based. 


133 

y^.^.  Partitioning  Recirculating  Networks 

Section  V.3  considers  partitioning  the  recirculating  Illiac,  Cube, 
and  PM2I  networks.  Both  single  and  multiple  control  partitioning  are 
considered. 

To  control  the  Illiac  network,  let  us  assui>]e  that  the  control  unit 
broadcasts  to  all  active  PEs  one  of  the  four  functions  to  execute  as 
they  would  any  other  instruction.  Then,  some  partitions  are  easily 
implemented  with  this  network. 

n“1 

Theorem  V.8;  An  Illiac  network  may  be  partitioned  into  2  groups  of  2 
PEs,  where  the  two  are  PE(i)  and  PE(i+1)  or  PE(i)  and  PE(i+m). 

Proof;  Let  the  group  be  PE(i)  and  PE(i+1),  for  either  i  even  or  i  odd. 
PE(i)  needs  only  Illiac^^  and  PE(i+1)  needs  only  Illiac_^  in  order  to 
move  data  about  the  submachine.  Let  the  group  be  PE(i)  and  PE(i+m). 
Then,  PE<i)  uses  Illiac  and  PE<i+m)  uses  Illiac  to  move  data  about 
the  network. 

The  Illiac  network  cannot  be  partitioned  into  groups  of  4  without 
some  restrictions.  If  PE(i),  PE(i+1),  PE(i+m),  and  PE(i+1+m)  are 
grouped  together.  Figure  V.12  shows  their  configuration  for  N  =  16. 
Note  that  if  a  complete  Illiac  network  was  available  for  a  system  of 
four  PEs,  as  in  Figure  V. 13,  then  PE(i)  and  PE(i+1+m)  would  be 
connected,  and  PE(i+1)  and  PE(i+m)  would  be  connected.  In  Figure  V.12, 
they  are  not.  The  partitioned  network  cannot  pass  data  in  the  same  way 
as  can  a  complete  Illiac  network  for  N  =  4.  If,  however,  by 


appropriately  controlling  the  active  status  of  the  PEs,  we  allow  PE(i) 


136 


1 


to  first  pass  data  to  PE<i+1)^  and  then  let  PE(i+1>  pass  this  data  to 
PE<i+1+m)/  the  appropriate  connection  between  PE(i)  and  PE(i+1+m)  can  be 
realized.  The  transfer  of  data  now  requires  that  the  interconnection 
network  be  used  twice^  and  thus  the  time  to  pass  the  data  is  doubled. 
Also,  steps  must  be  taken  to  insure  that  the  data  presented  to  PE(i+1) 
is  not  destroyed  while  passing  data  from  PE(i)  to  PE(i+1*m)  through 
PE(i+1). 

In  general,  for  groups  of  PEs  larger  than  two,  this  same 

observation  may  be  made.  For  a  partitioned  machine  of  size  2  ,  the  PEs 

at  the  side  edges  of  the  array  (  i.e.,  PE(0),  PE(2^^^  ~1),  PECZ*"^^  ), 
r/2  r 

PE  (2  (2  )-1),...,  and  PE (2  -  1)  ),  and  the  top  and  bottom  edges 
will  not  be  connected  as  they  would  with  a  complete  Illiac  network  for  N 
=  2^ 

The  Illiac  network  may  easily  exchange  data  between  two  adjacent 
PEs.  For  larger  groups  of  PEs,  the  data  may  need  to  be  passed  among  PEs 
to  simulate  the  edge  connections  of  the  Illiac.  This  results  in 
additional  considerations  for  the  control  of  the  network,  additional 
time  for  the  data  to  reach  their  final  destinations,  and  additional 
pitfalls  for  the  programmer.  The  problem  occurs  in  both  multiple  and 
single  control  partitioning,  since  the  edge  connections  for  a  group  can 
be  simulated  using  only  the  PEs  within  that  group. 

In  chapter  IV,  a  circuit  was  presented  for  a  recirculating  Cube 
network.  At  each  pass  through  the  network,  any  of  the  n  Cube 
interconnection  functions  may  be  selected  to  route  data. 


137 


Theorem  V..£:  A  recirculating  Cube  network  can  be  partitioned  into  2^”'^ 
groups  of  2^  PEs  under  single  or  multiple  control  partitioning,  if  all 
partitions  are  of  size  2^,  0  £  r  £  n. 

Proof;  A  recirculating  Cube  may  be  partitioned  in  a  manner  analogous  to 
that  of  a  multistage  Cube.  Let  the  addresses  of  all  PEs  in  each 
partition  have  the  same  n-r  most  significant  bits.  In  order  to 
partition  the  recirculating  Cube  network,  the  n-r  interconnection 
functions  Cube^,  Cube^^^,  ...,  and  Cube^_^  are  not  utilized  for  any  data 

p 

transfer.  So,  each  group  of  2  PEs  has  available  to  it  Cube^_^,  ..., 
Cube^,  and  Cube^.  Thus,  single  and  multiple  control  partitioning  are 

p 

possible  for  groups  of  size  2  . 

The  above  argument  can  be  extended  to  group  PEs  that  have  any  set 
of  n-r  bits  in  common.  C] 

Chapter  IV  introduced  a  circuit  for  a  recirculating  PM2I  network. 
Any  of  the  2n  -  1  distinct  PM2I  interconnection  functions  may  be  chosen 
at  each  use  of  the  network. 

Theorem  V.10:  A  recirculating  PM2I  network  may  be  partitioned,  under 

n**  r  r 

single  or  multiple  control  partitioning,  into  2  groups  of  2  PEs  so 
that  all  PEs  in  a  partition  have  the  same  n-r  least  significant  bits. 
Proof;  Theorem  V.6  showed  how  a  multistage  PM2I  network  can  be 
partitioned  into  groups  that  have  the  n-r  least  significant  bits  in 
common  by  not  utilizing  the  last  n-r  stages  of  the  network,  that  is,  the 
stages  for  PM2^.,  0  £  i  <  n-r.  In  this  manner,  each  group  has  available 
to  it  2*r  interconnection  functions,  where  now  behaves  as 

PM2^.  for  a  group,  0  £  i  <  r.  No  group  can  send  data  outside  of  its 


i 

> 


138 

group,  since  using  these  2*r  functions,  the  n-r  least  significant 
address  bits  never  change.  That  is,  for  all  PE  addresses  P  = 

Vi*-*PiV 


PM2  .  (P)  =  p'  .,,,p'.p.  .,,,D  D  . 
1.J  n-1  J  j-1  ^1^0' 


i  j  ^  n-r.  So,  each  group  may  use  the  network  independently  of  any 
other  group.  Thus,  single  and  multiple  control  partitioning  are 
possible  for  partitions  of  size  2^.  rn 


139 


Multiple  Control  Partitioning  and  Variable  Size  Partitions. 

For  multiple  control  partitioning^  partitioning  the  machine  into 
groups  of  PEs  of  equal  size  has  been  considered.  Some  networks  can  be 
partitioned  using  a  less  restrictive  multiple  control  partitioning  where 
the  only  constraint  is  that  all  partitions  be  of  size  a  power  of  2. 

Corollary  V.7;  If  a  network  can  be  partitioned  using  restricted  multiple 
control  partitioning^  where  all  partitions  are  the  same  size,  then  it 
can  also  be  partitioned  such  that  groups  are  of  different  sizes,  where 
all  sizes  are  powers  of  two. 

Proof;  Corollaries  V.2,  V.3,  and  V.6  can  be  applied  recursively  to  break 

large  partitions  into  smaller  ones.  Let  the  network  be  partitioned  into 

2'^”'”  groups  of  2^  PEs.  For  any  one  of  these  2*'  groups,  the  applicable 

r—|(  Ic 

corollary  may  be  applied  again  to  break  the  group  into  2  groups  of  2 
PEs,  0  <  k  <  r.  Figure  V.14  shows  a  generalized  cube  for  N  =  8  which  is 
partitioned  into  three  groups,  GO,  G1,  and  G2.  GO  and  G1  are  of  size  2. 

CD 


G2  is  of  size  4. 


Figure  V.14;  A  generalized  cube  can  be  partitioned  into  groups  of  dif¬ 
ferent  sizes  when  all  sizes  are  powers  of  two.  Here,  GO  and  G1  are  of 
size  2,  and  G2  is  of  size  4. 


fsS's^<i:'---<^j'r*^-'  “  -'  ..  •.•***•* 

141 

y^.£.  Conclusions 

Two  ways  have  been  defined  to  partition  an  SIMD  machine  consisting 
of  2^  processing  elements  into  two  to  2*^  ^  groups  of  2^  PEs,  each  group 

p 

acting  as  an  SIMD  machine  of  size  2  .  One  way,  single  control 

partitioning,  used  a  single  control  to  broadcast  instructions  to  all 

groups.  Thus,  the  same  SIMD  machine  program  may  be  executed  on  several 

different  data  sets  simultaneously  (e.g.,  multiplying  two  pairs  of 

matrices).  The  other  way,  multiple  control  partitioning,  assumes  the 

availability  of  additional  control  units  so  that  each  2^  PE  submachine 

may  have  its  own  instruction  stream.  The  data  being  acted  on  by 

different  submachines  may  be  copies  of  the  same  data  file  (e.g., 

processing  two  copies  of  the  same  image  to  find  different  features). 

Under  single  control  partitioning,  the  Shuffle-Exchange  network, 

the  STARAN  flip  network,  the  generalized  cube  network,  and  the  indirect 

n*“  r  r 

binary  n-cube  network  are  readily  partitioned  into  2  groups  of  2 

PEs.  The  data  manipulator  network  and  the  augmented  data  manipulator 

r  r 

can  group  certain  sets  of  2  PEs,  but  the  ability  to  connect  sets  of  2 

PEs  with  consecutive  addresses  is  restricted. 

The  n  stage  Shuffle-Exchange  network,  the  generalized  cube  network, 

the  indirect  binary  n-cube  network,  and  the  ADM  network  have  control 

structures  that  are  general  enough  to  accommodate  multiple  control 

partitioning.  The  data  manipulator,  the  flip  network,  the  indirect 

binary  n-cube,  and  the  Shuffle-Exchange  network  can  connect  any  two 

arbitrary  PEs  in  one  pass  through  the  network. 

A  pipelined  multistage  network  can  be  partitioned  in  the  same 

manner  as  the  combinational  logic  multistage  network  on  which  it  is 


142 


based.  Recirculating  Cube  and  PM2I  networks  can  be  partitioned  in  the 
same  fashion  as  the  multistage  Cube  and  PM2I  networks.  Table  V.3 
summarizes  the  results  of  Chapter  V. 

Chapter  V  has  defined  the  ways  in  which  certain  interconnection 
networks  can  be  partitioned.  This  information  is  essential  for 
selecting  an  interconnection  network  fur  a  partitionable  parallel 
processing  system.  In  addition,  the  results  of  chapter  V  are  relevant 
to  the  theory  of  the  types  of  permutations  from  input  addresses  to 
output  addresses  that  a  given  type  of  interconnection  network  with  a 
given  control  can  and  cannot  accomplish. 


143 


Table  V.3:  Summary 
cussed  in  Chapter  ' 
PEs,  0  <  r  <  n. 


NETWORK 


Multistage 

Shuffle-Exchange, 

omega, 

generalized  cube, 
indirect  binary 
n-cube 

Flip  network, 
flip  control 


Flip  network, 
shift  control 


data  manipulator 

Augmented 

data 

manipulator 

Recirculating 

cube 

Recirculating 

Illiac 

Recirculating 

PM2I 


of  the  partitioning  properties  of  the  networks  dis- 
N  =  2*^  PEs  are  partitioned  into  2^  ^  groups  of  2^ 


PARTITIONING 


Any  set  of  n-r  common  PE  address  bits  defines 
a  partition  for  either  single  or 
multiple  control  partitioning. 


For  single  control  partitioning,  any  set  of  n-r 
PE  address  bits  defines  a  oaf-’^tion. 

The  flip  network  with  flip  co  cannot  be 

partitioned  under  multiple  cc  trol  partitioning. 
For  single  control  partitioning,  »>>e  n-r  least 
significant  PE  address  bits  or  rhe 
n-r  most  significant  address  bits  define 
a  partition. 

Multiple  control  partitioning  is  not  possible. 
The  n-r  least  significant  PE  address  bits  define 
a  partition. 

The  n-r  least  significant  PE  address  bits  define 
a  partition. 

Also,  the  n-r  most  significant  address  bits 
define  a  partition. 

Any  set  of  n-r  PE  address  bits  defines  a 
partition. 

For  N  >  2,  the  Illiac  network  cannot  be 
partitioned. 

The  n-r  least  significant  PE  address  bits 
define  a  partitioning. 


U4 


yi .  THE  AUGflENTED  DATA  MANIPULATOR  NETWORK 

Introduction 

The  Augmented  Data  Manipulator  network  (ADM)  is  a  multistage  PM2I 
network  (Figure  III. 7)  where  each  switching  cell  of  the  network  receives 
control  signals  independently  of  any  other  cell  of  the  network.  This 
chapter  explores  some  of  the  capabilities  of  the  ADM.  Section  2 
compares  the  types  of  data  transfers  that  the  ADM  can  accomplish  with 
those  which  can  be  done  by  the  generalized  cube  network.  Section  3 
illustrates  how  some  data  transfers  can  be  accomplished  by  more  than  one 
control  setting.  Also  considered  here  are  some  transfers  which  the  ADM 
cannot  perform  in  one  pass  through  the  network.  Section  4  presents  some 
group  theoretic  properties  of  the  ADM  network. 


145 


^.2^.  The  ADM  and  the  Generalized  Cube  Network. 

The  data  manipulator  network  (Figure  III. 7)  consists  of  n  stages 
with  N  switching  cells  per  stage.  The  stages  are  ordered  from  n-1  to  0, 
where  the  interconnection  functions  of  stage  i  are  PM2^^.^  PM2_.,  and  the 
identity  (straight  across).  The  controls  of  the  data  manipulator  are 
limited  to  one  pair  per  stage.  Cells  whose  i^^  address  bit  is  0  respond 
to  one  control;  cells  whose  i^^  bit  is  1  respond  to  the  other  control. 

Because  groups  of  N/2  cells  respond  to  one  control  signal,  the  data 

manipulator  network  cannot  simulate  the  generalized  cube  network  with 
individual  interchange  box  control.  Since  the  hardware  cost  of  the  data 
manipulator  is  greater  than  that  of  the  generalized  cube  with  four 
function  interchange  boxes  (Chapter  IV),  there  are  many  cases  when  the 
data  manipulator  is  not  a  cost  effective  option. 

The  ADM  is  a  data  manipulator  with  individual  cell  control,  i.e., 
each  cell  can  receive  none,  one,  two,  or  three  of  the  signals  H 

(straight  across),  U  (PM2_J,  and  D  (PM2^.)  CSIS783.  The  data  output 
from  cell  P  and  stage  i  becomes  the  dat  input  to  cell  P*  at  stage  i-1, 
where  P*  <  -CP,  (P+2S  modulo  N,  (P-2S  modulo  N>.  Each  cell  passes  data 
independently  of  any  other  cell.  The  ADM  can  perform  all  the 

interconnection  functions  of  the  generalized  cube  network  with  four 
function  boxes. 

Theorem  VI. 1:  The  ADM  can  simulate  the  generalized  cube  network  with 
individual  box  control  CSIS78T. 

Proof:  Consider  simulating  the  four  function  box  at  stage  i  of  the 
generalized  cube  whose  upper  input  is  x  and  lower  input  is  y.  Let  the 


/ 


146 


subscript  x  denote  the  ADM  control  for  cell  x  at  stage  i.  Then^  the 

four  functions  of  the  interchange  box  can  be  simulated  as  follows. 

straight  -  H  H  , 

X  y 

exchange  -  0  D  , 

X  y 

lower  broadcast  -  U  H  . 

y  y 

upper  broadcast  - 


Recall  from  Chapter  II  that  data  transfers  from  one  PE  to  another 
can  be  represented  as  a  mapping  from  the  set  of  input  PE  addresses  to 
the  set  of  output  PE  addresses.  When  this  mapping  is  one-to-one  and 
onto,  the  data  transfer  will  be  called  a  permutation  of  input  PE 
addresses  to  output  PE  addresses.  For  a  given  network,  if  a  control 
setting  exists  which  accomplishes  permutation  f<P)  of  input  PE  addresses 
to  output  PE  addresses,  then  the  network  passes  the  permutation  f(P). 

A  permutation,  f(P),  from  PE  addresses  to  PE  addresses  can  be 
written  as: 


f(0)  f(1)  f(2) 


f(N-1) 


where  the  top  line  is  the  input  PE  address  and  the  bottom  line  is  the 

output  PE  address  to  which  f  maps  the  input.  A  permutation  can  cyclicly 

permute  a  set  of  elements  i^,  i^,...,  i^  such  that 

f(ig)  =  i^,  f(i^)  -  ^2^  •••  ^  ~ 

Write  this  cycle  as  (i-  i.  ...  i  ).  The  physical  interpretation  of  this 

u  I  r 

cycle  is  that  input  ig  sends  its  data  to  output  i^,  input  i^  sends  its 
data  to  output  12^  input  i^_^  sends  its  data  to  output  i^,  and 

input  i^  sends  its  data  to  output  ig.  Any  permutation  can  be  written  as 


147 


a  product  of  disjoint  cycles.  For  example,  if  N  =  8  and  f(P)  is  the 
permutation 


Then,  f(P)  can  be  written  as  the  product  of  four  cycles; 

(024  7)(1  5)(3)<6). 

That  is,  0  sends  data  to  2,  2  sends  data  to  4,  4  sends  data  to  7,  7 
sends  data  to  0,  1  and  5  exchange  data,  and  3  and  6  pass  straight  across 
the  network. 

The  permutations  of  input  PE  addresses  to  output  PE  addresses  which 
the  ADM  network  can  accomplish  are  a  superset  of  those  the  generalized 
cube  network  of  Figure  III. 6  can  perform  CSIS78].  In  this  section,  some 
classes  of  permutations  which  the  ADM  can  perform,  but  the  generalized 
cube  cannot,  are  enumerated. 

Theorem  VI. 2;  For  N  ^  4,  consider  an  SIMD  machine  with  an  ADM  network 
that  is  partitioned  under  multiple  control  partitioning.  Let  the  PEs  be 
divided  into  2*^  ^  groups  of  2^  =  4  PEs  such  that  the  addresses  of  the 
PEs  in  each  group  are 

»  .  <.n"2 
0*2  +  P; 

0  £  0  <  4,  for  some  fixed  P; 

0  <  P  <  2'^”^. 

Then,  within  each  group  of  four  PEs,  the  ADM  can  form  all  24 
permutations  of  four  items. 

Proof;  Let  a  control  Cp  affect  cell  P  at  stage  i.  If  C  =  H,  then  the 
connection  is  straight  across.  If  C  =  U,  then  data  is  moved  from  cell  P 
at  stage  i  to  cell  PM2_.  (P)  at  stage  i-1.  If  C  =  D,  then  data  moves 


from  cell  P  at  stage  i  to  cell  PM2^.  (P)  at  stage  i-1.  Table  VI. 1  lists 
the  24  possible  permutations  of  4  PEs  along  with  the  ADM  controls  with 
produce  these  permutations  for  a  2-stage  ADM  network.  CD 

Theorem  VI. 3;  The  generalized  cube  network  cannot  perform  all 
permutations  of  four  PEs. 

Proof:  Let  the  addresses  of  the  four  PEs  be  0,  1,  2,  and  3.  Then  eight 
permutations  cannot  be  passed  by  a  2-stage  generalized  cube  network. 
These  eight  are  expressed  as  cyclic  permutations,  as  follows. 

(0)  (1  2)  (3) 

(0)  (1  3  2) 

(0  1  2)  (3) 

(0132) 

(0231) 

(0  2  3)  (1) 

(0  3  1)  (2) 

(0  3)  (1)  (2). 

For  example,  the  permutation  (0)  (1  2)  (3)  cannot  be  performed  by  the 
generalized  cube  network.  For  N  =  4,  the  data  from  PE(0)  and  from  PE (3) 
must  pass  straight  across  both  stages  1  and  0  of  the  2-stage  network. 
So,  the  controls  for  the  two  interchange  boxes  which  affect  PE(0)  must 
be  set  to  no  exchange,  and  the  controls  for  the  two  interchange  boxes 
which  affect  PE(3)  must  be  set  to  no  exchange.  If  PE(1)  and  PE(2) 
exchange  data,  then  the  path  that  the  data  from  PE(1)  takes  should  be 
from  PE (001)  to  position  011  at  the  output  of  stage  1  to  PE (010)  at  the 
output  of  stage  0.  Since  data  from  PE(3)  travels  straight  across  the 


149 


Table  VI. 1;  For  N-4^  the  ADM  forms  of  all  24  possible  permutations 
of  4  PE  addresses. 


PERMUTATION 

CONTROLS 

STAGE  1 

STAGE  0 

(0)  (1)  (2)  (3) 

1111 

(0)  (1)  <2  3) 

1111 

”o”l'^2”3' 

(0)  (1  2)  (3) 

1111 

”0®1^2”3 

(0  1)  (2)  (3) 

1111 

d°u”h°h° 

W”2”3 

(0  2)  (1)  (3) 

1111 

dJh]u2hJ, 

h°h°h°h” 

”o“l”2”3 

(0  3)  (1)  (2) 

1111 

Vr2'’3 

(0)  (2)  <1  3) 

0  0  0  0 

(01)  (2  3) 

112  1 
^0^1 ^2^3' 

tv0,.0^0,,0 

Vi  *^2^3 

(0  3)  (1  2) 

1111 

Wzh' 

Vl^2‘>3 

(0  2)  (1  3) 

oJdJuJuJ, 

0  0  0  0 
Vi  ”2^3 

(0  1  2)  (3) 

1111 

Vl^2«3' 

(0  2  3)  (1) 

1111 

D0H1U2H3, 

UPH1H2D3 

t 


150 

Table  VI. 1.  (Continued) 


(0  3  2)  (1) 

oJhJuJh;, 

(0  2  1)  (3) 

dJhJuJh’, 

(0  1  3)  (2) 

hJoJhJu;, 

Vi  ”2^3 

(031)  (2) 

hJoJhJuJ, 

Vr2®3 

(0)  (1  2  3) 

1111 

Vi '’2^3 

(0)  (1  3  2) 

1111 

H0D1H2U3, 

0  0  0  0 
Vl^2«3 

(0321) 

hJhJhJhJ, 

0  0  0  0 
u  u  u  u 
Vl‘^2^3 

(0  3  12) 

D0D1U2U3, 

Vl®2^3 

(0  2  3  1) 

^0*^1  ^2^3' 

Vl”2'^3 

(0213) 

^0^1 ^2^3' 

Vi  ”2^3 

(0132) 

®0'^1^2S' 

Vl‘^2^3 

(0123) 

”o”l”2”3' 

^0-0^0  0 
Vl‘'2’>3 

151 


network,  a  conflict  occurs  between  PE(1)  and  PE(3)  at  stage  1.  CH 

Theorems  VI. 2  and  VI. 3  indicate  that  even  for  small  values  of  N, 
the  ADM  is  a  more  complex  network  than  the  generalized  cube.  Some 
classes  of  permutations  which  can  be  performed  by  the  ADM  but  not  by  the 
generalized  cube  can  be  defined. 

Theorem  VI. 4;  Let  P  and  Q  be  two  PEs  such  that,  for  some  i,  0  ^  i  <  n-1, 
Q  =  P  i  2^,  and  the  addresses  of  P  and  Q  differ  in  two  or  more  bit 
positions.  Then,  the  ADM  can  pass  the  permutation 

(0)  (1)  ...  (P  Q)  ...  (N-1). 

The  generalized  cube  cannot  pass  this  permutation. 

Proof;  For  Q  =  P  +  2^,  set  all  stages  except  stage  i  to  "straight."  At 
stage  i,  P  emoloys  PM2^.  and  Q  employs  PM2_..  That  is,  P  and  Q  exchange 
data  at  stage  i,  and  so  no  conflicts  can  occur  in  passing  this 
permutation.  For  example,  if  N  =  16,  i  =  4,  P  =  0100,  and  Q  =  1000, 
then  PM2^^(P)  =  PM2_^(Q). 

For  Q  =  P  -  2^,  set  all  stages  except  stage  i  to  "straight."  At 
stage  i,  P  employs  PM2_^  and  Q  employs  PM2^^.  So,  P  and  Q  exchange  data 
at  stage  i,  and  no  conflicts  result  in  passing  the  data  through  the 
network. 

The  generalized  cube  cannot  perform  this  type  of  permutation.  P  and 
Q  differ  in  two  or  more  bit  positions,  and  so  more  that  one  stage  of  the 
network  must  execute  a  cube  function.  If  this  happened,  then  data  from 


PEs  other  than  P  and  Q  would  be  affected 


CD 


152 


In  Chapter  IM,  section  IV. 3  showed  that  a  multistage  Shuffle- 
Exchange  network  cannot  perform  a  perfect  shuffle  (Figure  II. 6a)  in  one 
pass  through  the  network.  Since  the  generalized  cube  network  is 
topologically  equivalent  to  a  multistage  Shuffle-Exchange  network,  the 
generalized  cube  cannot  perform  a  perfect  shuffle  in  one  pass  through 
the  network.  However,  the  ADM  can. 

Theorem  VI. 5;  The  ADM  can  perform  a  perfect  shuffle  in  one  pass  through 
the  network. 

Proof;  The  control  settings  for  stage  i,  n  >  i  0,  are  determined  as 
follows.  The  address  of  a  cell  P  at  stage  i-is  P  =  p^_^...P^Pq. 

set  stage  n-1  to  straight  across; 
for  i  =  n-2  step  -1  until  0 
if  p^^^  *  p^  then 

if  P...  =0  then  set  cell  P  at  stage  i  to  PM2..; 

'^i+l  +1 

else  if  p.^^  =  1 

then  set  cell  P  at  stage  i  to  PM2_.; 
else  set  stage  i  to  straight  across; 

For  the  controls  calculated  from  the  algorithm,  after  stage  i,  data 

originally  from  PE  P  =  p^_^...p^Pg  is  at  the  output  of  cell  P*  = 

D  _o  ,...p.p  .p.  ....p...  So,  after  passing  through  the  ADM  network 
^n-2  n-3  i'^n-1'^i-l  0 

using  the  settings  calculated  above,  data  from  input  P  is  moved  to 
output  PE<P')  where  P'  =  Pn_2* •  *Po*^n-1  ‘  f'’9ure  VI.  1  shows  the  data 
movement  through  an  ADM  for  N=8  for  the  perfect  shuffle  permutation. 

At  stage  i  of  the  network,  data  from  input  cell  P  = 

..p^^^p^_^p.p^_^...pg  is  moved  by  the  algorithms  to  output  cell 


Pn-2Pn-3- 


154 


P'  =  p„  ,...p..,p.p^  ^p.  ^...p-.  For  all  cells  that  have  addresses 

n“c  1+1  1  n”i  i~i  U 

where  p.  =  p  the  "straight  across"  connection  is  used.  Thus,  P  =  P' 
1  n—i 

and  no  conflicts  of  data  can  occur  for  these  cells  of  the  network.  If 
p.  ^  then  two  cases  are  considered. 

(1)  P  =  p^_2...p.^-j01p._^...Pg  moves  data  to  output  cell  P'  = 

p^_2. • .p.^^ 10p._^ . . .Pq.  The  only  input  cell  position  at  stage  i 

that  could  conflict  with  the  data  from  P  is  cell  Q  = 

Pi^_2* •  •  •  "Pq  “  **'■  algorithm  moves  this  data  to 

output  cell  P  at  the  input  to  stage  i-1.  So,  data  from  P  and  data 
from  Q  cannot  conflict  at  stage  i. 

(2)  P  =  p^_2. . .p .^^lOp . .Pg  is  moved  by  the  algorithm  to  output  cell 

P'  =  p^_^...p^^^01p._^...Pg.  This  case  is  analogous  to  case  (1). 

There  is  no  input  cell  position  of  the  network  that  the  algorithm 
allows  to  conflict  with  P. 

Since  no  data  can  conflict  with  any  other  data  when  it  moves  in  a 
straight  across  path,  when  it  moves  in  a  PM2^.,  or  when  it  moves  in  a 
PM2_.  path,  the  algorithm  generates  control  settings  which  allow 
conflict  free  passage  of  the  data  through  the  network.  C] 


155 


Capabilities  and  Limitations  of  the  ADM 

For  some  data  transfers^  more  than  one  control  setting  may  exist 
for  the  ADM  network.  This  is  not  true  for  the  generalized  cube  network, 
where  stage  i  and  only  stage  i  can  change  bit  p.  as  the  input  address  is 
mapped  to  the  output  address. 

For  example,  consider  the  number  of  legal  control  settings  of  the 

ADM,  those  for  which  no  conflicts  occur  between  data  passing  through  the 

network.  As  an  illustration,  consider  an  ADM  network  for  N  =  8  (Figure 

III. 7).  Stage  2  performs  PM2^2*  Since  ^^2^2  “  ^ 

perform  Cube^.  That  is,  data  can  be  conditionally  exchanged  by  cell 

positions  whose  addresses  differ  in  bit  P2.  There  are  four  such  pairs  of 

8/2 

cells  at  stage  2.  So,  the  number  of  legal  control  settings  is  2  = 

16. 

Stage  1  can  perform  PM2^^ .  From  Corollary  V.6,  two  independent 

groups  of  PEs  pass  data  among  themselves  at  stage  1.  These  are  the  even 

numbered  PEs,  0,  2,  A,  and  6,  and  the  odd  numbered  PEs,  1,  3,  5,  and  7. 

Since  the  two  groups  are  independent,  the  number  of  legal  control 

settings  for  this  stage  is  square  of  the  number  for  one  of  the  groups. 

Consider  the  group  of  even  numbered  PEs.  There  are  nine  legal  control 

settings.  If  the  four  PEs  are  grouped  to  conditionally  exchange  data 

between  PEs  P2OO  and  PEs  P2IO  (Figure  VI. 2),  as  in  the  cube  network, 

then  2*"  =  4  legal  control  settings  exist.  If  the  PEs  are  grouped  to 

conditionally  exchange  data  between  PEs  P2OO  and  PEs  Pj>10  (Figure  VI. 3), 
2 

then  2  =4  legal  control  settings  exist.  Both  of  these  groups  contain 

the  identity  (straight  across)  setting,  and  so  the  two  groupings 
2 

exihibit  2*2  -1=7  distinct  legal  control  settings.  One  setting 


157 


Figure  VI. 3i  Stage  1  of  the  ADM  can  condit'ionaL Ly  exchange  data  between 
PEs  P  and  (P+2)  trodulo  N  for  P  =  X10. 


exists  which  maps  P  to  P',  where  P'  =  (P  +  2)  modulo  and  one  setting 

exists  which  P  to  P' ,  where  P'  =  (P  -  2)  modulo  N.  These  7  +  1  +  1  =  9 

control  settings  are  the  only  legal  settings  for  the  four  even  PEs  at 

stage  1.  Any  other  settings  cause  conflicts  to  occur.  So^  the  total 

2 

number  of  legal  control  settings  at  stage  1  is  9  =81. 

The  number  of  legal  control  settings  for  stage  0  can  be  similarly 
counted.  Data  is  moved  among  all  8  PEs  at  this  stage.  Data  can  be 

4 

conditionally  exchanged  between  PEs  P  and  P+1  for  P  even  to  yield  2  = 

16  legal  control  settings.  Data  can  be  conditionally  exchanged  between 

4 

PEs  P  and  P+1,  for  P  odd,  to  yield  2  =16  control  setting.  These  two 

cases  give  2*16  -  1  =  31  distinct  legal  control  settings.  The  final  two 
cases  move  data  from  PE(P)  to  PE(P')  for  P'  =  (P  +  1)  modulo  N,  and  from 
PE(P)  to  PE<P*)  for  P*  =  (P  -  1)  modulo  N.  The  number  of  distinct  legal 
control  settings  for  stage  0  is  31  +  1  +  1  =  33. 

The  total  number  of  legal  control  settings  for  a  3-stage  ADM 
network  is  (16)*(81 )*(33)  =  42,768.  But,  the  number  of  permutations  of 
eight  items  is  8!  =  40,320.  So,  there  must  be  some  permutations  that 
can  be  performed  using  more  than  one  control  settings;  some  control 
settings  must  be  redundant.  The  next  theorem  demonstrates  one  form  of 
these  redundant  settings. 

Theorem  VI. 6:  Consider  the  interconnection  function  f(x)  =  Cube.,  for 
all  i,  0  £  i  <  n-1,  for  all  x,  0  £  x  <  N.  Then,  there  are  n-i  different 
control  settings  for  the  ADM  which  realize  f(x). 

Proof:  Cube,  exchanges  data  between  P  =  p  ....p...0p.  ,...p„  and  P'  = 
- 1  n— 1  1+1  1-1  U 

p  ....p...1p . .  So,  P  =  (P*  -  2^)  modulo  N,  and  P’  =  (P  +  2^) 

^n-1  1+1  1-1  0 

modulo  N.  Cube,  can  be  realized  by  setting  the  ADM  controls  such  that 


159 


at  stage  t,  cells  whose  v  address  bit  equal  0  perform  PM2^.,  while 
those  whose  i^*^  address  bit  is  1  perform  PM2_^..  Since,  2^  =  2^*^  -  2^, 
an  equivalent  control  setting  is  for  PEs  P  to  execute  PM2^^.^^j  and 
Pf12_^,  and  for  PEs  P'  to  execute  and  PM2^..  Since, 

2'  =  2''"^  -  2' 

=  2^*2  -  2^^^  -  2^ 

=  2*'  -  -  ...  -  2^  n  <  k  <  i. 


tnere  are  n-i  different  settings  for  the  ADM  which  accomplish  Cube.. 

Since  all  PEs  with  p.  =  0  execute  PM2. .  (PM2  .),  while  PEs  with  p. 

1  +3-3  1 

=  1  execute  PM2_^  (PM2^J,  no  collisions  occur  in  passing  data  through 

the  network.  C] 


Consider  the  uniform  shift  permutations  that  send  data  from  PE(P) 

to  PE(P')  where  P'  =  (P  +  A)  modulo  N,  0  <  A  <  N,  for  all  PEs.  The 

generalized  cube  network  can  form  this  permutation  of  PE  addresses  in 

one  pass  through  the  network,  due  to  CLAW75]  and  the  topological 

equivalence  of  the  multistage  Shuffle-Exchange  network  and  the 

generalized  cube  CSIS78D.  However,  only  one  distinct  control  setting 

exists  for  a  given  uniform  shift  permutation.  The  ADM  has  redundant 

settings  to  realize  this  permutation.  Let  the  binary  representation  of 

A  be  a  ....a.a~. 
n-1  1  □ 

Theorem  VI. 7:  The  ADM  has  redundant  control  settings  for  all  uniform 
shifts  of  A,  0  <  A  <  N. 


160 


Proof; 
Case  1: 


Case  2: 


0  £  A  <  N/2.  If  a.  <  -CO^  1>,  then  the  ADM  can  be  set  according 
to  the  following. 

At  stage  i,  if  a^.  =  0,  then  set  the  network  to  straight 
across.  If  a^  =  1,  then  set  the  network  to  PM2^.. 

Let  A  be  expressed  in  signed  digit  notation,  where 

a' .  <  <0,  +1,  -1>.  So,  A  is  expressed  as  the  sum  and 

difference  of  powers  of  2^.  For  example,  for  N  =  8,  A  =  0111 

can  also  be  expressed  as  A  =  100(-1)  =  8  -  1,  as  A  =  10(-1)1  = 

8-2+1,  and  as  A  =  1 (-1 )11  =  8  -4  +  2  +  1.  Thus,  the 

following  are  all  equivalent  representations  of  A,  for  N  ^  8, 

where  '‘1...11"  indicates  a  string  of  one  or  more  1's. 

a'  ....aV  01... 110a* -...a' 
n-1  k  j  0 

a*  ....a’,  10...0(-1)0a' ....a'_ 
n-1  k  3  0 

a*  ....a*. 10...(-1)10a' ....a'„ 
n-1  k  3  0 

a’  ....a*.  1(-1)1...10a' ....a'_ 
n-1  k  3  0 

where  n  <  k  <  3*  £  0.  Each  of  these  different  representations 

of  A  can  be  used  to  yield  control  settings  for  the  ADM  network 

according  to  to  following  algorithm 

At  stage  i,  if  a.  *  0,  then  set  stage  i  to  straight 

across.  If  a.  =  1,  then  set  stage  i  to  I'f  3^  = 

-1,  then  set  stage  i  to  PM2_^. 

N/2  £  A  <  N.  The  corresponding  uniform  shift  permutation  is 

the  same  as  the  permutation  f(P)  =  (P-B)  modulo  N,  for 


161 


0  £  B  <  N/2.  For  this  case^  there  are  settings  of  the  network 
equivalent  to  that  of  case  1.  For  example,  for  N  =  16,  B  =  0011 
=  010(-1)  =  1<-1)0<“1).  Thus,  case  2  can  be  proven  in  a  manner 
analogous  to  case  1.  Cl 

If  the  stages  of  the  ADM  network  are  traversed  in  reverse  order, 
that  is,  the  interconnection  functions  of  the  first  stage  are  ^^2^(0)/ 
the  interconnection  functions  of  the  i^^stage  are  PM2^^^j,  ano  the 
interconnection  functions  of  the  last  stage  are  the  resulting 

network  is  called  the  Inverse  Augmented  Data  Manipulator  (lADM).  Figure 
VI. 4  shows  the  lAOM  for  N  =  8.  Stage  i  of  both  the  ADM  and  lADM  performs 
PM2^.,  but  the  first  stage  of  the  ADM  is  labeled  stage  n-1  while  the 

first  stage  of  the  lADM  is  labeled  stage  0. 

Lemma  VI. 1;  The  ADM  passes  the  permutation  f  if  any  only  if  the  lADM 

-1 

passes  the  inverse  permutation,  f  . 

Proof!  Suppose  the  ADM  performs  the  mapping  from  P  to 

f(P)  =  (P  +  I  ,2""^  +  I  ,2"“^  +  ...  +  1^2°)  modulo  N, 
n-i  n~1  0 

I^  <  T  0,  +1,-1  >,  0  <  j  <  n,  for  all  P,  0  <  P  <  N. 

Each  input  PE(P)  has  its  own  set  of  Ij  to  specify  its  path  through  the 

network.  At  stage  j,  if  Ij  =  0  for  a  given  cell,  then  that  cell  sends 

data  straight  across.  If  Ij  =  +1,  then  cell  Q  sends  data  to  Q  +  2^  If 

I.  =  -1,  then  cell  Q  sends  data  to  Q  -  2^. 

J 

Recall  that  the  first  stage  of  the  lADM  is  labelled  stage  0,  while 
the  first  stage  of  the  ADM  is  labelled  stage  n-1.  Set  the  stages  of  the 
lAOM  as  follows.  If  input  cell  Q  at  stage  i  of  the  ADM  is  set  to 
(PM2_^),  then  set  input  cell  Q+2^  at  stage  i  of  the  lADM  to  PM2_. 


163 


m 


sm 


(PM2^.).  In  traversing  the  ADM,  if  data  passes  from  input  cell  Q  to 
output  cell  Q+2^,  then  in  traversing  the  lADM,  the  data  passes  from 
input  cell  Q+2^  to  output  cell  Q.  If  cell  Q  at  stage  i  of  the  ADM  is 
set  to  straight,  then  so  is  cell  Q  of  stage  i  of  the  lADM.  The  data 
transfer  through  the  lADM  reverses  the  effects  of  the  transfer  through 
the  ADM.  The  lADM  can  thus  perform 

(f(P)  -  I  ,2""‘’  -  I  -2""^  -  ...  -  In2°)  modulo  N  =  P. 
n~i  n“c  U 

The  translation  from  the  lADM  to  the  ADM  is  analogous.  CU 


Lemma  VI. 1  can  be  used  to  show  that  the  ADM  cannot  perform  some 
permutations  of  input  addresses  to  output  addresses. 


Theorem  VI. 8:  The  lADM  cannot  perform  a  perfect  shuffle  in  one  pass 
through  the  network. 

Proof:  Recall  that  the  perfect  shuffle  is  defined  as 

Shufflelp  ....p.Pn)  =  p  ....PnP  .. 

'^n-1  10  n-2  0  n-1 

Consider  three  consecutive  PEs,  PE(P),  PE(P+1  modulo  N),  and  PE(P-1 
modulo  N).  There  exists  at  least  one  P  such  that 

P  =  Pn-r--PlPo'  Pn-1  *  Pn' 


n-1  "  ^0' 


P'  =  P+1  =  p'^_^...p’^p'g,  where  p'^_^  =  p’^,  and 
P"  =  P-1  =  p* . .p' '^p' 'g,  where  p''^_^  =  P' 'q« 

For  example,  for  N  =  16,  one  such  set  is  P  =  0011,  P+1  =  0100,  and  P-1  = 
0010.  In  general,  one  such  P  is  0*^  ^11.  The  difference  of  the 
addresses  P  and  Shuffle(P),  |  P  -  Shuffle(P)|,  is  an  odd  number  if  and 
only  if  p^_^  ^  Pg.  Since  no  combination  of  PM2^.  and  PM2_.,  0  <  i  <  n, 
yields  an  odd  number,  data  from  P  must  use  PM2^g  at  stage  0.  But,  data 
from  P'  =  P+1  modulo  N  and  from  P"  =  P-1  modulo  N  must  use  the  straight 


164 


across  connection  at  stage  0.  Since  (P)  =  (P  +  1)  modulo  N  =  P’ , 

and  PM2_g  (P)  =  <P  -  1)  modulo  N  =  P" ,  a  conflict  occurs  at  stage  0 
between  P  and  P'  or  between  P  and  P".  Therefore^  the  lAOM  cannot 
,  perform  a  perfect  shuffle  in  one  pass  through  the  network.  CD 


Corollary  VI. 1 :  The  ADM  cannot  perform  an  inverse  perfect  shuffle  in  one 
pass  through  the  network. 

Proof:  Follows  from  Lemma  VI. 1  and  Theorem  VI. 8.  CD 


A  bit  reversal  algorithm  transfers  data  from  PE  P  =  p^_^...p^Pg  to 

PE  P'  =  p„p....p  ..  Data  is  sent  to  the  PE  whose  address  is  the  reverse 

.  01  n-1 

of  the  current  location.  This  permutation  is  useful  in  the  computation 
of  fast  fourier  transforms  CSS79D. 


Theorem  VI. 9:  For  N  >  8^  the  lADM  cannot  perform  a  bit  reversal  in  one 

\ 

pass  through  the  network. 

Proof:  Assume  that  all  PEs  whose  addresses  in  binary  are  palindromes, 
i.e.,  they  are  the  same  read  from  the  left  or  from  the  right,  such  as 
00100  and  01010,  do  not  pass  data.  This  frees  these  positions  in  the 
network  to  be  used  by  the  remaining  positions.  An  argument  based  on 
eliminating  the  palindromes  will  give  the  most  optimistic  results. 

Consider  three  consecutive  binary  PE  addresses,  P,  P  +  1,  and  P  -  1 
such  that  none  of  the  three  are  palindromes.  P  =  p^_^  ...  P^Pg/  where 
Pn_i  ^  Pg^  P  +  1  =  P'n-1  •  •  •  P'lP'o'  P'n-1  =  P'g^  and  P  -  1  = 


n-1  •  •  • 


p*'^p''g,  where  P''^_-|  =  P''g*  example,  P  could  be 


0^‘'l. 


165 


A  conflict  occurs  as  follows.  The  distance  between  P  and  its  bit 
reversal^  Pr,  |P  -  Pr|/  is  an  odd  number.  Since  no  combination  of 
and  PM2  0  <  i  <  n^  yields  an  odd  number^  or  PM2  _  must  be 

applied  to  P.  Thus^  loc^  (P)  =P+1orP-1.P+1  and  P  -  1  are  an 
even  distance  from  the  positions  of  their  bit  reversals.  Thus,  neither 
Pf12^g  or  PM2_q  can  be  applied  to  data  from  P  +  1  or  P  -1.  So,  loc^ 
(P  +  1)  =  P  +  1,  and  loc^  CP  -  1)  =  P  -  1.  At  stage  0,  a  conflict 

occurs  between  P  and  P  +  1  or  between  P  and  P  -  1. 

For  N  =  8,  beginning  with  address  000,  for  each  three  consecutive 
addresses,  at  least  one  is  a  palindrome.  The  conflict  situation  above 
cannot  occur.  CD 

Corollary  VI. 2;  The  ADH  cannot  perform  a  bit  reversal  in  one  pass 
through  the  network. 

Proof;  Follows  from  Lemma  VI, 1  and  Theorem  VI. 9.  CD 


166 


VI. 4.  Some  Theoretical  Results 

Consider  all  mappings  from  the  input  addresses  to  the  output 
addresses  that  are  bijections  (one-to-one  and  onto)^  i.e.,  the  set  of 
all  permutations  of  N  items.  The  ADM  cannot  perform  all  N!  such 
permutations  in  a  single  pass  through  the  network.  The  ADM  can  perform 
some  subset  of  all  permutations  in  one  pass  through  the  network  called 
the  allowable  permutations  of  the  ADM.  Consider  applying  two  or  more 
mappings  to  data  passing  through  the  ADM^  that  is^  making  multiple 
passes  through  the  network,  where  each  pass  realizes  one  of  the 
allowable  permutations  of  the  ADM.  This  is  called  composition  of 
mappings. 

Theorem  VI. 10;  The  set  of  all  allowable  permutations  of  the  ADM  which 
can  be  performed  in  one  pass  through  the  network  is  not  a  group  under 
composition  of  mappings. 

Proof;  Let  S  be  the  set  of  all  allowable  permutations  of  the  ADM.  The 
perfect  shuffle  permutation  is  a  member  of  S,  but  neither  of  its 
possible  inverses,  the  inverse  shuffle  or  (n-1)  perfect  shuffles  can  be 
realized  by  the  ADM  in  one  pass  through  the  network.  So,  since  S  does 
not  contain  the  inverse  of  all  its  members,  S  is  not  a  group.  CD 

Consider  the  set  defined  by  the  mappings  of  Theorem  VI. 4  and  all 
compositions  of  these  mappings.  These  permutations  exchange  one  or  more 
pairs  of  PEs,  (P|^,  Q|^),  such  that  for  any  pair  k,  =  Pj^  i  2^,  0  £ 
i  <  n,  0  £  P|^,  Q|^  <  N,  and  P|^  and  differ  in  two  or  more  bit 
positions.  Add  to  this  set  the  identity  mapping,  and  call  the  resulting 


167 


set  of  permutations  T. 

Theorem  VI. 11:  The  set  T  for  the  ADM  is  a  group  under  composition  of 
mappings. 

Proof;  Any  member  of  T  exchanges  a  pair  of  PEs  by  applying  only  one 
interconnection  function^  either  or  Pf12_.,  to  that  pair.  The 

inverse  mapping  is  attained  by  applying  PM2_.  or  PM2^.  to  these  same  two 
PEs.  So^  this  inverse  mapping  is  in  T. 

Let  a,  h,  and  c  be  three  mappings  in  T.  Because  these  mappings 
affect  pairs  of  PEs^  the  order  of  their  application  to  the  data  is 
irrelevant^  that  is^ 

(a  *  b)  *  c  =  a  *  (b  *  c), 

where  *  is  composition  of  mappings.  So^  T  is  associative. 

Finally/  the  identity  permutation  is  in  T,  and  so  T  is  a  group.  CD 


J' 


'-■r’sp'- 


168 


VI. 5.  Conclusions 


This  chapter  has  pointed  out  some  of  the  capabilities  and 
limitations  of  the  ADM.  The  set  of  mappings  of  input  PE  addresses  to 
output  PE  addresses  which  the  ADM  can  perform  is  a  superset  of  those  of 
the  generalized  cube  network.  Some  classes  of  permutations  have  been 
presented  which  can  be  performed  by  the  ADM  but  not  by  the  generalized 
cube  network.  The  ADM  also  has  multiple  control  settings  for  some 
mappings^  such  as  the  Cube,  functions  and  the  uniform  shift  functions. 
If  one  path  through  the  network  is  not  operating^  then  an  alternate 
route  can  be  selected.  So^  the  network  has  fault  tolerant  properties 
for  some  permutations.  The  ADM  network  cannot  perform  all  N! 
permutations  of  N  input  PE  addresses  to  output  PE  addresses.  This 
chapter  has  shown  two  such  permutations,  the  inverse  shuffle  and  the  bit 
reversal  permutations. 


-  - 


.-K^r 


169 


VII.  PARALLEL  ALGORITHMS  AND  INTERCONNECTION  NETWORKS 


VII. 1 .  Introduction 

In  designing  algorithms  for  parallel  computers,  some  researchers 
assume  that  no  data  transfer  time  penalties  are  incurred  in  the 
computation  rSAM77,  HEL77,  KST73I!.  Other  investigators  have  indicat;»d 
that  this  may  not  be  a  reasonable  assumption  rGEN78n.  This  chapter 
investigates  the  impact  of  the  interconnection  network  on  the 
performance  of  selected  parallel  algorithms  for  image  processing. 

Image  processing  tasks  are  well  suited  to  parallel  processing 
C0S78a,  CDL78,  SIE783,  and  many  parallel  computer  architectures  have 
been  proposed  for  the  task  CFU78D.  Yet,  the  effects  of  the 
interconnection  network  on  the  efficiency  of  an  algorithm  have  not  been 
explored.  This  chapter  selects  several  representative  image  processing 
tasks  and  formulates  a  parallel  algorithm  for  each.  For  each  algorithm, 
the  time  required  to  pass  data  among  PEs  is  examined  for  various 
interconnection  networks.  Section  2  defines  the  picture  that  will  be 
used  for  all  algorithms  of  this  chapter.  It  describes  how  the  picture 
is  distributed  among  the  N  PEs  of  the  parallel  computer  system.  Section 
3  presents  a  parallel  smoothing  algorithm.  Section  4  considers  formation 
of  a  histogram  of  the  grey  levels  of  the  picture.  This  histogram  can 


1?0 


H 

then  be  used  to  threshold  the  picture  at  a  specific  grey  Level  in  order 
to  extract  subpictures  of  the  image.  Section  5  presents  a  classification 

I 

sLgorithffl  which  3ssigns  3  pixeL  to  one  of  M  cLdSses  b3sed  on  certsin 

I 

statistics  known  about  each  class. 


I 


AD-A078  690  PURDUE  UNIV  LAFAYETTE  IN  SCHOOL  OF  ELECTRICAL  ENGINEERING  F/G  9/2 

DESIGN  AND  ANALYSIS  OF  INTERCONNECTION  NETWORKS  FOR  PARTITIONAB— ETC(U) 
AUG  79  S  D  SMITH  »  H  J  SIEGEL  AF0SR-78-S581 

UNCLASSIFIED  TR-EE79-39  AFOSR-TR-79-1272  mi 


V 


171 


VII. 2.  Image  Processing:  Data  Distribution 


In  the  algorithms  for  this  chapter,  the  data  Image  Is  a  black-and- 
white  512  X  512  pixel  (picture  element)  image  with  128  grey  levels. 
Suppose  that  1024  processors  are  available,  and  that  each  processor 
stores  a  16  X  16  block  of  the  512  X  512  image.  Assume  that  the  1024  PEs 

are  logically  arranged  as  an  array  of  32  X  32  PEs,  and  that  the  PE 

addresses  range  from  0  to  1023: 

PE  0  PE  1  .  .  .  PE  31 

PE  32  PE  33  .  .  .  PE  63 

•  •  • 

PE  992  ...  PE  1023 


Assume  that  the  16  X  16  blocks  are  stored  in  row  major  order.  So,  PE(0) 
stores  the  pixels  of  columns  0  to  15  cf  rows  0  to  15  of  the  images, 

PE(1)  stores  the  pixels  of  columns  16  to  31  of  rows  0  to  15,  and  so 

n  k  k 

forth.  In  general,  N  =  2  PEs  operate  on  a  picture  of  2  X  2  pixels, 

for  k  an  integer.  The  PEs  are  arranged  in  a  2*^^^  X  2*^^^  array,  where 


n/2  is  an  integer,  such  that  each  PE  stores  in  its  memory  a  block  of 

.  ,  ,  .  -k-(n/2)  „  _k-(n/2) 

pixels  of  size  2  X  2 

For  notational  purposes,  let  each  PE  consider  its  16  X  16  matrix  as 


[h(0,0)  .  .  .  h(0,14)  h(0,15)'l 

h(15,0)  .  .  .  h(15,15j 


Also,  let  the  subscripts  of  h(i,j)  extend  to  -1  and  16,  if  necessary,  in 
order  to  aid  in  calculations  across  boundaries  of  two  adjacent  blocks  in 
different  PEs.  For  example,  the  pixel  to  the  left  of  h(0,0)  is  h(0,-1), 
and  the  pixel  below  h(15,15)  is  h(16,15).  So,  -1  <  i,j  <  16.  Let 
h<i,  j  ->  k)  refer  to  the  set  of  pixels  h<i,j),  h(i,j+1),  ...,  h(i,k-1), 
h(i,k) . 


172 


VII. 3.  Image  Processing;  ^  Smoothing  Algorithm 

One  goal  of  any  smoothing  algorithm  is  to  suppress  noise  in  a 

picture.  The  algorithm  of  this  section  does  this  by  averaging  the  noise 

over  all  8  pixels  ’’n  a  3  X  3  pixel  square  neighborhood  about  pixel 

h(i,j). 

Problem  statement;  Smooth  an  image^  h^  of  size  512  X  512  pixels  by 
replacing  hCi^j)  by  the  average  grey  level  of  its  8  nearest  neighbors, 
h(i,j+1),  h(i,j-1),  h<i-1,j),  h(i+1,j),  h(i-1,j->1),  h(i-1,j+1), 

h(i+1,j+1),  and  hCi+1,j-1). 

A  general  algorithm  to  perform  the  smoothing  on  pixel  h(i,i)  to 
yield  pixel  hs(i,j)  is; 

for  i  =  0  step  +1  until  15  do 

for  j  =  0  step  +1  until  15  do 

hs(i,3)  =  1/8  *  (  h(i4-1,j)  +  h(i-1,j)  +  h(i,j+1) 

+  h(i,j-1)  +  h(i+1,j-1)  +  h(i+1,3+1) 

+  h(i-1,j+1)  +  h<i-1,j-1)  ). 

The  approach  to  this  algorithm  is  to  perform  1024  16  X  16  pixel 
evaluations  in  parallel,  rather  than  one  512  X  512  evaluation  as  in  the 
sequential  algorithm. 

At  the  boundaries  of  the  16  X  16  array,  data  must  be  transmitted 
between  PEs  in  order  to  calculate  the  smoothed  value,  hs.  For  example, 
h<-1,n)  must  be  transferred  from  the  PE  'above'  the  local  PE,  except  for 
PEs  0  through  31,  those  at  the  'top  edge'  of  the  logical  array  of  PEs. 
If  a  PE  is  at  the  edge  of  the  array,  and  has  no  neighboring  PE  at  that 
edge,  the  boundary  values  of  h  will  be  set  to  the  value  at  the  edge. 
For  example,  pe(0)  contains  the  first  block  of  data,  and  does  not 
receive  h(-1,  0  15).  Instead,  it  sets  h<-1,  0  ->  15)  to 


173 


h(0^  0  -»  15).  To  take  these  data  transfers  into  consideration,  the 
following  steps  must  be  executed  before  the  algorithm  above.  "Set  ICN 

to  PE+j"  sets  the  J_nter£onnection  r^etwork  so  that  PE(P)  sends  data  to 

PE(P  +  j  modulo  N).  "BLOCK  TRANSFER  :=  vector"  indicates  that  the 
interconnection  network  transfers  the  vector  of  data  as  a  block,  as  fast 
as  the  hardware  can  accommodate  it.  Recall  that  the  data  transfer 
registers  (DTRs)  pass  data  between  the  interconnection  network  and  a  PE. 
A  PE  loads  data  into  its  DTRin  register,  and  the  network  moves  the  data 

to  the  DTRout  register  of  the  destination  PE.  The  instruction  to  load 

data  is 

DTRin  :=  data; 

and  the  instruction  to  retrieve  data  from  the  network  is 

variable  :=  DTRout; 

PE  address  masks  are  used  extensively  in  the  algorithms  of  this  chapter. 
Recall  from  Chapter  II  that  positive  PE  address  masks  specify  the 
addresses  of  all  PEs  to  remain  active  for  the  instructions  which  follow. 
Negative  PE  address  masks  specify  which  PEs  are  deactivated  for  the 
coming  instructions.  No  mask  implies  all  PEs  are  active. 

Algorithm  VII. 1  passes  all  the  data  in  blocks  before  undertaking 
the  smoothing  calculation. 

The  transfers  of  data  needed  for  Algorithm  VII. 1  can  be 
accomplished  by  many  kinds  of  interconnection  networks.  Seven  types 
will  be  considered  for  the  algorithm  above.  They  are: 

a)  a  recirculating  Cube  network  (Figure  IV. 5), 

b)  a  combinational  logic  multistage  Cube  network  (Figures  III. 3, 
III. 5,  and  III. 6), 


174 


**  Move  all  data  before  calculations  begin.  ** 

Set  ICN  to  PE+32; 

BLOCK  TRANSFER  :=  h(15^  0  ->  15); 

**  Turn  off  PEs  on  the  top  edge  of  the  array  of  PEs  ** 

**  and  input  all  data  as  a  block.  ** 

MASK 

h(-1^  0  ->  15)  :=  BLOCK  TRANSFER; 

MASK  rO^X^D 

h<-1,  0  15)  :=  h(0,  0  -»  15); 

Set  ICN  to  PE-32; 

BLOCK  TRANSFER  :=  h(0,  0  ->  15); 

**  Turn  off  all  PEs  on  the  bottom  edge  of  the  array  of  PEs.  ** 

MASK  C-1^X^] 

h(16,  0  ->  15)  :=  BLOCK  TRANSFER; 

MASK  CI^X^? 

h(16,  0  -»  15)  :=  h(15,  0  ->  15); 

Set  ICN  to  PE+1; 

BLOCK  TRANSFER  :=  h(0  ->  15^  15); 

Algorithm  VII.I  (continued  on  next  page):  A  parallel  smoothing  algo¬ 


rithm 


175 


**  Turn  off  PEs  on  the  Left  hand  edge  of  the  array  of  PEs.  ** 
riASK 

h(0  ->  15^  -1)  :=  BLOCK  TRANSFER; 

MASK  rx^O^T 

h<0  ->  15,  -1)  :=  h(0  15,  0); 

Set  ICN  to  PE-1; 

BLOCK  TRANSFER  :=  h<0  -»  15,  0); 

**  Turn  off  all  PEs  on  the  right  hand  edge  of  the  array  of  PEs.  ** 

MASK  C-X^1^D 

h(0  ->  15,  16)  :=  BLOCK  TRANSFER; 

MASK  cx^i^n 

h<0  -»  15,  16)  ;=  h(0  ->  15,  15); 

Set  ICN  to  PE+32+1; 

DTRin  :=  h(15,15); 

MASK  :-0^X^3  OR  c-x^o^: 

h(-1,-1)  :=  DTRout; 

MASK  CO^X^l  OR  CX^O^] 

h(-1,-1)  :=  h(0,0); 

Set  ICN  to  PE+32-1; 

DTRin  :=  h(15,0); 

MASK  C-O^X^D  OR  C-X^1^] 
h(-1,16)  :=  DTRout; 

MASK  CO^X^]  OR  CX^I^] 

h(-1,16)  :=  h(0,15); 

Algorithm  VII. 1  (continued  on  next  page):  A  parallel  smoothing  algo¬ 
rithm. 


4 


MASK  C-1‘'xn  OR  :-X  rD 


DTRout 


MASK  Crx  3  OR  CX  1  3 


MASK  C-1  X'^3  OR  C-X  0  3 


DTRout 


MASK  C1^X^3  OR  CX^O'3 


**  Data  transfer  is  complete.  ** 


**  Do  the  calculation 


Algorithm  VII. 1  (continued):  A  parallel  smoothing  algorithm 


c)  a  pipelined  multistage  Cube  network^ 

d)  a  recirculating  Pf12I  network  (Figure  IV. 6)^ 

e)  a  combinational  logic  multistage  PM2I  network  (Figure  111.7)^ 

f)  a  pipelined  multistage  PM21  network^  and 

g)  an  Illiac  network  (Figure  II. 8). 

Assume  that  the  network  available  is  as  wide  as  the  data  word,  and  so 
can  pass  one  data  word  in  one  pass  through  the  network. 

Recall  from  Chapter  IV  the  following  formulas  for  calculating  data 
transfer  delay  times.  For  a  recirculating  network,  the  time  to  pass  one 
datum  is 

Trc  =  dr  +  dm  +  drn  +  dr  =  dm  +  drn  +  2  dr, 
where  dr  is  the  delay  of  a  register,  drn  is  the  delay  of  the 
combinational  logic  of  the  network,  and  dm  is  the  delay  of  a  multiplexer 
used  to  select  data  input  to  the  network  from  either  DTRin  or  DTRout. 
An  n  =  log2  N  stage  combinational  logic  multistage  network  has  delay 
Tm  =  dr  +  dms  *  n  +  dr  =  dms  *  n  +  2  dr, 
where  dms  is  the  delay  of  one  stage  of  combinational  logic  of  the 
network.  A  k-stage  pipelined  multistage  network  has  the  following  delay 
to  pass  S  segments  of  data: 

Tk  =  dr  +  (k  +  S  -  1)*(  n/k  *  dms  +  dr  ) . 

The  number  of  passes  through  a  recirculating  Cube  network  that 
executes  the  interconnection  function  needed  for  the  parallel  smoothing 
algorithm  is  given  by  Table  VII. 1.  Recall  that  sending  data  from  PE(P) 
to  PE(P  +  2^  modulo  N)  is  equivalent  to  saying  that  the  interconnection 
function  maps  address  P  to  address  P  +  2^  modulo  N.  That  is,  2^  is 
added,  modulo  N,  to  P. 


178 


Table  VII. 1:  The  recirculating  Cube  network  requires  the  given  nuirber 
of  passes  to  execute  the  interconnection  functions  needed  for  a  parallel 
smoothing  algorithm. 


function  number  of  passes 


PE+32 

PE-32 

PE+1 

PE-1 

PE+32+1 ^ 
PE+32-V 
PE-32-1 ,and 


maximum  5  passes  for  all  PEs,  Cube^  through 
Cube^,  times  16  data  items;  total  80  passes, 
similar  to  PE+32:  total  80  passes, 
maximum  5  passes  for  all  PEs,  Cubeg  through 
Cube^  (some  PEs  are  masked  off),  times  16  data 
items:  total  80  passes, 
similar  to  PE+1:  total  80  passes, 
at  most  10  passes  for  all  PEs, 

for  each  interconnection  function,  times  4  data  items; 
total  40  passes. 


PE-32+1 


179 


'I 

% 

e 


Assume  all  PEs  know  their  binary  PE  address  P  =  ^...P^Pq.  An 

algorithm  for  the  binary  addition  of  2^  to  P  is  as  follows. 


p.  :=  p.; 

1  t' 

if  p.  :=  0  then  Cin  :=  1; 
for  j  =  i+1  step  +1  until  n-1 
Cout  :=  p^.  ■  Cin; 

Pj  :=  Pj  •  Cin; 

Cin  ;=  Cout; 

A  one  is  added  in  bit  p.^  thus  complementing  p..  Any  resultant  carry 
bit,  Cin  or  Cout,  equal  to  one  ripples  through  the  remaining  address 
bits,  changing  p^  =  1  to  p.  =  0,  until  the  ripple  is  stopped  by  some  p. 
=  0,  i  <  j  <  n. 

The  recirculating  Cube  network  simulates  this  addition  as  follows 
i:siE77bT. 

for  j  =  i  step  +1  until  n-1 
MASK 

Cube.; 

The  command  "Set  ICN  to  PE+32"  with  a  recirculating  Cube  network 
executes  this  algorithm  for  i  =  5. 

The  mapping  from  P  to  P  -  2^  modulo  N  is  similar.  The  borrow.  Bin 
or  Bout,  propagates  beginning  at  p^,  changing  p^  =  0  to  p^  =  1,  until  it 
is  stopped  by  p^  =  1,  i  <  j  <  n.  A  subtraction  algorithm  is 


iidaiiiijaiairti 


.jiJSiWi'l--' 


181 


Sl- 


for  1=0  step  +1  until  4 
MASK  rX^°'''o''3 


Cube . ; 
i' 


The  algorithm  for  PE-1  is  similar,  where  the  maximum  length  borrow  chain 
occurs  for  PE(X^IOOOO). 

The  "PE+33''  connection  combines  the  algorithms  for  P+32  and  P+1. 
The  recirculating  Cube  uses  the  following  algorithm  to  accomplish  "Set 
ICN  to  PE+33." 

**  Add  1  to  all  addresses.  ** 
for  j  =  0  step  +1  until  4 
MASK 


Cube . ; 


5„5. 


*★ 


**  At  this  point,  only  PEs  (X  0  )  will 
**  have  a  carry  out  from  the  first  addition. 

**  They  hold  the  data  which  began  in  PEs  X^l^. 

**  For  them,  p^  will  not  change,  because  ** 

+*  1  (the  carry  out)  +  1(the  addition  of  32)  +  p^  =  ** 

**  1  (the  carry  out  of  P^)^  p^fthe  sum  bit).  *+ 

MASK  C-X'^"^0^]| 

Cube^; 

**  Add  the  carries  resulting  from  adding  32  to  the  addresses.  ** 
for  j  =  6  step  +1  until  n-1 

MASK  CX"~^0^'^X^D  OR  Cx""^0^”*X0^3 
Cube.; 

The  PE-33  connection  executes  a  similar  algorithm. 


182 


The  recirculating  Cube  uses  the  following  algorithm  to  execute  "Set 
ICN  to  PE+31."  Adding  31  to  an  address  P  adds  1  in  all  five  least 
significant  bit  positions.  Thus,  address  bits  p^,  p^,  p^,  and  p^ 

are  all  complemented  by  the  addition.  The  only  PE  addresses  that  do  not 
have  a  carry  out  of  bit  position  p^  are  the  initial  addresses  ^0^. 
After  step  five  of  the  addition  algorithm,  data  from  these  PEs  resides 
in  PE  ^1^.  Since  these  addresses  now  have  the  data  intended  for 
them,  they  are  masked  off  for  the  remainder  of  the  algorithm.  The  last 
five  steps  of  the  algorithm  continue  an  addition. 

for  i  =  0  step  +1  until  4 
MASK  rx"~''i'': 

Cube.; 

1 


** 

5  5 

Data  that  began  in  PEs  (X  0  )  are  now 

in 

PEs  (X^l^). 

** 

** 

For  only  this  set  of  PEs,  there  is  no 

** 

carry  out  of  bit  p^.  So,  they  do  not 

** 

** 

participate  in  the  last  five  steps  of 

the  algorithm. 

The  last  five  steps  add  only  the  carry 

out 

**  of  bit  p,  to  the  remainder  of  the  address.  ** 

4 

for  j  =  5  step  +1  until  n-1 

MASK  CX'^“^0^"^X^3  AND  C-X^1^3 
Cube j ; 

The  command  "Set  ICN  to  PE-31"  executes  a  similar  algorithm. 

A  total  time  of  160(dm  +  drn  +  2*dr)  +  16n(dm  +  drn  +  2*dr)  +  40(dm 


+  drn  +  2*dr)  =  360(dm  +  drn  +  ?*dr)  is  required  for  the  recirculating 


183 


Cube  network  to  pass  the  data.  From  [ISIE79bD,  the  number  of  passes 
shown  here  represents  a  lower  time  bound  for  the  Cube  interconnection 
functions  to  accomplish  the  data  transfers. 

A  multistage  Cube  network  can  execute  any  shift  permutation,  i.e., 
a  mapping  from  PE(x)  to  PE(x  +  j  modulo  N),  in  one  pass  through  the 
network.  Theorem  1  of  I;lAW73D  showed  this  result  for  the  omega  network. 
In  CPEA77],  this  result  is  shown  for  the  indirect  binary  n-cube  network. 
The  network  can  form  any  uniform  shift  permutation.  It  is  shown  that 
repeated  applications  of  the  unit  shift  permutation  can  be  done  instead 
in  one  pass  through  the  network.  A  combinational  logic  Cube  network 
would  require  the  number  of  passes  indicated  in  Table  VII. 2.  A  total 
time  of  32<dms*10  +  2*dr)  +  32(dms*10  +  2*dr)  +  4(dms*10  +  2  dr)  = 
68(dms*10  +  2*dr)  is  needed  to  pass  the  data  items. 

Let  the  network  be  pipelined  into  5  pieces,  where  each  piece  is  two 
stages  of  the  multistage  network  followed  by  a  register.  A  5-stage 
pipelined  multistage  Cube  network  would  require  the  time  indicated  in 
Table  VII. 3.  A  total  time  of  4(dr  +20<dms*2  +  dr))  +  4<dr  +  5(dms*2  + 
dr))  =  8*dr  +  1f)n(dms*2  +  dr)  is  required  for  the  pipelined  multistage 
Cube  network  to  transfer  the  data. 

A  recirculating  Pf12l  network  passes  the  data  according  to  Table 
VII. 4.  The  total  time  for  the  recirculating  PM2I  network  to  pass  the 
data  is  (64  +  8)*(  dm  +  drn  +  2  dr  )  =  72(dm  +  drn  +  2  dr). 

The  multistage  Pf12I  network  can  execute  any  PE+j  transfer  in  one 
pass  through  the  network,  -1024  <  j  <  1024,  i.e.,  PE  1,  PE  31,  PE  32, 
and  PE  33.  Thus,  any  one  of  the  transfers  for  this  algorithm  requires 
one  use  of  the  network  to  reach  its  destination.  The  time  for  the 


184 


Table  VII>2:  A  combinational  10910  cube  network  repuires 
passes  listed  to  accomplish  the  interconnection  functions 
smoothing  algorithm. 


the  number  of 
for  a  parallel 


function  number  of  passes 


PE+32, 

PE-32^ 

PE+1,  and 
PE-1 

PE +32+1^ 
PE+32-1 , 
PE-32+1,  and 


one  pass  times  16  data  items: 
total  64  passes. 


one  pass  for  each  times  4  data  items: 
total  4  passes. 


PE-32-1 


185 


Table  VII. 3:  A  5-stage  pipelined  multistage  cube  network  passes 
for  a  parallel  smoothing  algorithm  in  the  times  shown  above. 


function 


time 


PE+32^ 

PE-32^ 

PE+1^  and 
PE-1. 

PE+32+1 , 

PE +32-1^ 
PE-32+1 ^  and 


dr  +  5(dms*2  +  dr)  +  15(dms*2  +  dr) 

for  each:  total  4(dr  +  20(dms*2  +  dr)). 


dr  +  5(dms*2  +  dr)  for  each: 
total  4(dr  +  5<dms*2  +  dr)) 


PE-32-1 


186 


Table  VII. 4:  A  recirculating  PM2I  network  passes  data  for  a  parallel 
smoothing  algorithm  as  shown  above. 


function  number  of  passes 


PE+32^ 

PE-32^ 

PE+1^  and 
PE-1 

PE+32+1^ 
PE+32-1 , 
PE-32+1,  and 
PE-32-1 


one  pass  for  each  datum: 
total  16(4)  =  64  passes 


two  passes  for  each  datum: 
total  4(2)  =  8  passes. 


187 


combinational  logic  multistage  PM2I  network  to  transfer  the  data  equals 
that  for  the  combinational  logic  multistage  cube  network  to  pass  the 
data.  Table  VII. 5  Indicates  the  data  transfer  times  for  a  5-stage 
pipelined  PM2I  network.  To  transfer  all  the  boundary  values^  the 
pipelined  network  requires  time  Tk  =  4(dr  +  5(dms*2  +  dr)  +  15(dms*2  + 
dr))  +  4(dr  +  5(dms*2  +  dr))  =  8  dr  +  lOOldms  *  2  +  dr). 

Recall  from  Chapter  II  that  the  Illlac  network  Is  a  recirculating 
network,  where  the  Interconnection  functions  are 


Illlac^^ 

(P) 

=  >^2.0 

(P) 

Ill1ac_^ 

(P) 

=  ™2-0 

(P) 

Illlac^ 

+m 

(P)  = 

(P) 

Illlac 

-m 

(P)  = 

(P) 

where  m  =  VN  and  n  =  Log2  N.  For  this  algorithm,  m  =  32,  and  so  the 
data  transfer  time  for  the  Illlac  network  Is  the  same  as  that  for  a 
recirculating  PM2I  network. 

Compare  the  data  transfer  times  for  the  pipelined  multistage  Cube 
and  PM2I  networks  and  the  combinational  logic  multistage  Cube  and  PM2I 
networks.  If  dms  =  dr  =  D,  then; 

Tk  =  4(610)  +  640  =  3080,  and 
Tm  =  68(120)  =  8160. 

For  this  example,  the  percent  difference  between  the  time  for  the 
pipelined  network  to  transfer  data  and  the  unpipellned  network  to 
transfer  data  Is  (  (816  -  308)/816  )  =  62%. 

Compare  the  data  transfer  time  for  the  5-stage  pipelined  PM2I 
network  and  the  recirculating  Pf12l  network.  The  pipelined  network 
requires  time  Tk  =  308  0  to  transfer  the  data,  and  the  recirculating 


188 


Table  VII. 5:  A  5-stage  pipelined  multistage  PM2I  network  passes  data 
for  a  parallel  smoothing  algorithm  according  to  the  table  above. 


function 


time 


PE+32, 

PE-32^ 

PE+1,  and 
PE-1. 

PE+32+1 , 
PE+32-1 , 
PE-32+1 ^  and 


dr  +  5(dms*2  +  dr)  +  15(dms*2  +  dr) 
for  each: 

total  4(dr  +  20<dms*2  +  dr)). 

dr  +  5(dms+2  +  dr) 
for  each: 

total  A(dr  +  5(dms*2  +  dr)). 


PE-32-1 


189 

network  requires  time  Trc  =  288  D  to  transfer  the  data.  The  pipelined 
network  is  6.9X  slower  than  the  recirculating  network^  and  is  more 
costly.  In  this  case,  the  pipelined  PM2I  network  is  less  cost  effective 
than  the  recirculating  PM2I  network. 

In  summary,  if  D  =  dms  -  dr  =  drn  =  dm,  then  Table  VII. 6  shows  the 
delay  to  pass  the  data  for  each  network  considered.  The  times  will  vary 
according  to  the  size  of  the  picture  stored  in  each  PE,  since  this 
affects  the  size  of  any  block  transfer.  Figure  VII. 1  shows  this 
variance. 

The  results  of  Algorithm  VII. 1  showed  that  when  data  is  transferred 
in  blocks,  some  time  advantage  is  gained  when  using  a  pipelined 
multistage  interconnection  network.  An  algorithm  can  be  designed  that 
does  not  transfer  data  in  large  blocks,  but  rather  transfers  data  one 
word  at  a  time  as  it  is  required  in  the  calculation.  Consider  the 
boundary  points.  First,  hs  for  row  0  of  the  matrix  H  is  calculated  by 
transfering  the  data  needed  for  one  point,  and  then  calculating  hs  for 
that  pixel.  Next,  the  same  is  done  for  row  15,  column  0  and  column  15. 
Finally,  the  calculation  is  completed  by  determining  hs  for  those  points 
not  at  the  edge  of  H,  that  is,  those  that  require  no  data  transfers. 
Algorithm  VII. 2  illustrates  this  approach. 

The  total  number  of  transfers  for  this  algorithm  is 
22+22+12+12  =  68.  The  time  for  a  combinational  logic  multistage 
network  to  execute  these  transfers  is  Tm  =  68  <  Sm  (  10*dms  +  2*dr  ))  = 
68  (12  Sm  D),  where  Sm  is  the  number  of  segments  per  data  word 
transmitted  by  the  network,  and  D  =  dms  =  dr.  The  time  to  transfer  the 
data  for  a  k-stage  pipelined  network  is  Tk  =  68  (  dr  +  (k  +  Sk  -  IXn/k 


I 


190 


Table  \lll.6:  Summary  of  data  transfer  times  for  different  networks 
a  parallel  smoothing  algorithm. 


network 


recirculating  Cube 

combinational  logic 
multistage  Cube 

5-stage  pipelined 
Cube 

recirculating  PM2I 

combinational  logic 
multistage  Pf12l 

5-stage  pipelined 
PM2I 

Illiac 


time 

I  360(dm  +  drn  +  2  dr)  =  1440  D 

I  68(dms*10  +  2*dr)  =  816  D 

I  8dr  +  100(dms*2  +  dr)  =  308  0 

I  72 (dm  +  drn  +  2  dr)  =  288  D 

I  68(dms*10  +  2*dr)  =  816  D 

I  8dr  +  100(dms*2  +  dr)  =  308  D 

I  72(dm  +  drn  +  2  dr)  =  288  D 


i 


t 


for 


192 


**  CALCULATE  hs  FOR  ROW  0  ** 

**  Move  the  necessary  data  into  the  PE  from  all  adjacent  PEs.  ** 

Set  ICN  to  PE+1 

OTRin  :=  hl0,15); 

**  Turn  off  PEs  on  the  left  edge  of  the  array  of  PEs.  ** 

**  They  have  no  PE  to  the  left^  and  so  should  not  receive  ** 

**  any  data.  ** 

MASK  C-X^0^3 

h(0^-1>  :=  DTRout; 

MASK 

h(0^-1)  :=  h(0,0); 

DTRin  :=  h(1,15); 

MASK  :-x^o^] 

h(1^-1)  :=  DTRout; 

MASK  :x^o^: 

h(1,-1)  :=  hd^O); 

Set  ICN  to  PE+33 

DTRin  ;=  h(15^15); 

Algorithm  VII. 2  (continued  on  next  page):  A  parallel  smoothing  algorithm 
that  does  not  transfer  data  in  blocks. 


193 


**  Turn  off  PEs  on  the  top  row  and  the  left  edge  ** 

**  of  the  array  of  PEs.  They  have  no  neighboring  PEs  ** 

**  from  which  to  receive  data.  ** 

MASK  c-o^x^:  OR  c-x^o^: 

h(-1,-1>  :=  OTRout; 

MASK  EO^X^D  OR  EX^O^D 
h<-1^-1)  :=  h(0,0); 

Set  ICN  to  PE+32 

DTRin  :=  h<15,0); 

**  Turn  off  PEs  on  the  top  row  of  the  array  of  PEs.  ** 

MASK  E-O^X^: 

h<-1,0)  ;=  DTRout; 

MASK  EO^X^: 

h(-1,0)  ;=  h(0,0); 

**  CALCULATE  ROW  0  ** 

for  j  =  0  to  14  by  +1  do 
Set  ICN  to  PE+32 

DTRin  :=  h<15^3+1); 

MASK  E-0^X^3 

h(-1,j+1)  :=  DTRout; 

MASK  EO^X^D 

h(-1^j+1)  :=  h<0,j+1); 

hs(0,3)  :=  1/8  *  (h(1,3)  +  h<-V3>  +  h(0^3+l>  +  h<0,3-1) 

+  h(1^j-1)  +  h(1,3+1)  +  h<-1^3+1)  +  h(-1,3-1)  ); 

Algorithm  VII. 2  (continued  on  next  page):  A  parallel  smoothing  algorithm 
that  does  not  transfer  data  in  blocks. 


T 


194 


Set  ICN  to  PE+32-1 
DTRin  :=  h(15^0); 

**  Turn  off  PEs  in  the  top  row  and  right  hand  edge  ** 

**  of  the  array  of  PEs.  ** 

MASK  n-O^X^l  OR 

h(-1^16)  :=  DTRout; 

MASK  CO^X^:  OR  CX^1^] 
h(-1,16)  :=  h<0,15); 

Set  ICN  to  PE-1; 

DTRin  ;=  h(0,0); 

**  Turn  off  PEs  on  the  right  hand  edge  of  array  of  PEs.  ** 
MASK  C-X^1^T 

h(0,16)  :=  DTRout; 

MASK  CX^lS 

h(0,16>  :=  h(0^15>; 

DTRin  ;=  h(1,Q); 

MASK  c-x^i^: 

h(1,16)  ;=  DTRout; 

MASK  CX^I^T 

h(1,16)  :=  h(1^15); 

hs(0^15)  :=  1/8  *  (h(1,15)  +  h(-1,15)  +  h(0,16)  +  h(0^14) 
+  h<1^14)  +  h(1,16)  +  h{-1,16)  +  h(-1^14)) 


Algorithm  VII. 2  (continued  on  next  page):  A  parallel  smoothing  algorithm 
that  does  not  transfer  data  in  blocks. 


**  ROW  15  IS  DONE  IN  AN  ANALOGOUS  MANNER  ** 

**  TOTAL  NUMBER  OF  DATA  TRANSFERS  FOR  ROW  0  =  22  ** 

**  For  column  0,  assume  that  h(14,-1)  and  h(15,-1)  ** 

**  hax/e  been  transferred  for  calculation  for  row  15  ** 
for  i  =  1  step  +1  until  12  do 
Set  ICN  to  PE+1 

DTRin  :=  h(i+1,15); 

MASK  C-O^X^D 

h(i+1^-1)  :=  DTRout; 

MASK 

h(i+1,-1)  :=  h(i+1^0); 

hs(i^O)  :=  1/8  *  (h<i+1^0)  +  h(i-1,0)  +  h(i^1)  +  h(i,-1) 

+  h(i+1,-1)  +  h<i+1,1)  +  h(i-1,1)  +  h(i-1^-1)) 
hs(13,0)  :=  1/8  *  (h(H,0)  +  h(12,0)  +  h(13,1)  +  h(13^-1) 

+  h(U,-1)  +  h(14,1)  +  h(12,1)  +  h(12,-1)  ) 
hs(14,0)  :=  1/8  *  <h(15,0)  +  h<13,0)  +  h(14,1)  +  h(14,-1) 

+  h(15,-1)  +  h(15^1)  +  h(13,1)  +  h(13,-1)) 

**  COLUMN  15  IS  DONE  IN  THE  SAME  MANNER  ** 

**  NUMBER  OF  DATA  TRANSFERS  FOR  COLUMN  0  =  12  ** 

**  CALCULATE  hs  FOR  THE  REMAINDER  OF  H  ** 
for  i  =  1  step  +1  until  14  do 
for  j  =  1  to  14  by  +1  do 

hs(i,j)  :=  1/S  *  (h(i+1^j)  +  h(i-1,j)  +  h(i,j+1)  +  h(i^j-l) 

+  h(i+1^j-1)  +  h(i+1,j+1)  +  h(i-1^j+1)  +  h<i-1,j-1)  ) 

Algorithm  VII. 2  (continued):  A  parallel  smoothing  algorithm  that  does 
not  transfer  data  in  blocks. 


196 


*  dms  +  dr))  =  68  D  (1  +  (k  +  Sk  -  1)(n/k  +  1)),  where  Sk  is  the  number 
of  segments  per  data  word  transmitted  by  the  network.  For  example^  if 
Sk  =  1  and  k  =  5,  then  Tk  =  1088  0.  The  times  for  the  recirculating  and 
combinational  logic  multistage  networks  to  do  the  transfers  are  the  same 
as  the  first  algorithm  presented.  The  pipelined  version,  however, 
differs,  since  the  advantages  seen  by  block  transfer  of  data  are  not 
found  in  the  second  algorithm. 

In  summary,  a  parallel  algorithm  for  smoothing  a  512  X  512  image 
2 

requires  0(16  )  =  0(256)  steps  to  process  the  image  plus  any  extra  time 
needed  to  transfer  data  to  the  required  PEs.  If  it  is  assumes  that  a 
recirculating  network  requires  one  computational  time  unit  to  pass  one 
data  item,  then  the  data  transfer  times  for  the  recirculating  cube, 
PM2i,  and  Illiac  networks  is  not  negligible  compared  to  the  time  to 
perform  the  smoothing  operation.  For  the  pipelined  and  combinational 
logic  multistage  networks,  the  transfer  of  68  data  items  is  not 
negligible  compared  to  the  time  for  the  remainder  of  the  algorithm, 
although  the  pipelined  network  algorithm  will  fare  better  than  the 
combinational  logic  multistage  network  using  block  transfers.  For  this 
algorithm,  the  best  choice  of  interconnection  network  would  be  a 
recirculating  PM2I  or  an  Illiac  network. 


197 


VII. 4.  Image  Processing:  Thresholding 

Segmentation  of  a  picture  extracts  the  subpictures  of  the  image  for 
further  study.  If  a  subpicture  contains  a  range  of  grey  Levels 
different  from  the  rest  of  the  picture  (the  background),  then  a 
histogram  of  the  grey  level  distribution  of  the  pixels  will  show  a  peak 
which  corresponds  to  the  grey  levels  of  the  subpicture.  The  subpictures 
are  extracted  from  the  image  by  selecting  threshold  values  of  the  grey 
Levels  which  are  the  values  at  the  bottoms  of  the  valleys  on  either  side 
of  the  peak  of  the  histogram.  This  method  is  called  threshold  detection 
by  mode  method  CR0K76,  CDL77,  CDL78D.  A  parallel  histogram  formation 
algorithm  is  defined  below. 

The  calculations  for  this  task  are  based  on  a  technique  called 
recursive  doubling  IIST75H.  A  recursive  doubling  algorithm  is  a  type  of 
parallel  algorithm  which,  using  N  PEs,  combines  N  items  in  0(  log2  N  ) 
steps.  As  an  example,  consider  forming  the  sum  of  N  numbers,  D(0),  D(1), 
...  ,  D<N-1).  let  PE(i)  store  0<i),  0  <  i  <  N,  in  its  memory.  The 
first  step  of  the  algorithm  fohms  N/2  intermediate  sums  in  parallel. 
Each  sum  is  D(i)  +  D(i-1),  for  i  even,  0^  i  <  N,  and  is  stored  in 
PE(i).  For  the  first  step  of  the  algorithm,  then,  the  PEs  execute 


**  All  odd  numbered  PEs  pass  DATA,  D(i),  for  i  odd.  ** 

MASK 

SET  ICK  to  PE-1; 

DTRin  :=  DATA; 

**  ALL  even  numbered  PEs  add  the  incoming  data  to  the  DATA,  ** 

**  D(i),  for  i  even,  that  is  stored  in  the  memory  of  PE<i>.  ** 

MASK 

SUM  =  DATA  +  DTRout; 

After  each  step,  the  number  of  intermediate  sums  is  halved.  The 
last  step  sums  two  intermediate  sums  to  give  the  final  sum  of  N  numbers. 
A  general  recursive  doubling  algorithm  to  sum  N  numbers  using  N  PEs  with 
the  final  sum  in  PECO)  is  described  below. 

sum  :=  DCi),  for  all  PEs  i,  0  £  i  <  N. 
for  i  =  0  step  +1  until  n  -  1 
MASK 

set  ICN  to  PE-CZ"*); 

DTRin  :=  sum; 

MASK  [:x""^'''^'’^Q0''l 

sum  :=  sum  +  DTRout; 

Figure  VII. 2  illustrates  the  recursive  doubling  process  for  N  =  8  items 
using  8  PEs. 

Problem  statement t  The  image  h  is  distributed  through  1024  PEs,  as 
described  above.  Each  PE  has  calculated  a  128  bin  histogram,  HIST,  for 
its  piece  of  h.  Merge  the  1024  pieces  of  HIST  into  one  histogram 
residing  in  PECO).  The  histogram  is  now  ready  to  determine  a  threshold 


V 


199 


Figure  VII. 2:  A  recursive  doubling  algorithm  sums  8  data  items  in  3 
steps,  where  each  step  is  a  data  transfer  among  PEs  followed  by  an  addi¬ 
tion. 


i 


200 


value  for  h. 

For  N  =  8,  the  algorithm  may  be  represented  as  in  Figure  VII. 3. 
Until  the  last  step^  the  histogram  is  passed  as  two  independent  halves. 
Two  simultaneous  recursive  doubling  algorithms  sum  the  two  halves,  and 
the  results  reside  in  PEs  0  and  1.  The  last  step  merges  the  two  halves 
into  the  final  histogram  of  128  bins.  Each  step  of  the  algorithm  is  a 
data  transfer  followed  by  an  addition.  After  step  1,  even  numbered  PEs 
hold  all  of  the  first  half  of  HIST  (i.e.,  PEs  0,  2,  4,  and  6  hold  HIST(0 
63),  and  odd  numbered  PEs  hold  all  of  the  second  half  (i.e.,  PEs  1, 
3,  5,  and  7  hold  HIST (64  ->  127).  After  step  2,  PEs  0  and  4  hold  all  of 
the  first  half  of  HIST  for  PEs  0-3  and  4-7,  respectively.  PEs  1  and  5 
hold  the  second  half  of  HIST.  After  step  3,  PECO)  holds  all  of  the 
first  half  of  HIST,  and  PE  1  holds  all  of  the  second  half  of  HIST.  The 
final  step  merges  the  two  halves  in  PE(0).  Thus,  N  pieces  of  the 
histogram,  HIST,  are  summed  to  PE(0)  in  (dog  N)  +  1)  *  (ff  bins)/2  data 
transfers  and  ((log  N)  +  1)  *  (#  bins)/2  additions,  where,  here,  (# 
bins)  =  128. 

For  N  =  1024,  Algorithm  VII. 3  calculates  the  histogram  using  the 
technique  above. 

A  sequential  algorithm  for  calculating  HIST  requires  512  *  512  = 
262,144  steps.  The  parallel  algorithm  described  above  uses  16  *  16  = 
256  steps  for  each  PE  to  calculate  its  local  histogram.  It  uses  (log^ 
n  +  1)  *  128  steps  to  merge  the  histogram  into  PECO),  where  n  =  10. 

Each  step  is  a  data  transfer  or  an  addition.  So,  the  parallel  algorithm 
uses  256  +  704(additions)  +  704(transfers)  =  1664  steps  to  give  the  same 
histogram  as  the  sequential  algorithm.  This  assumes  that  the  time  to  add 


i 


and  X01 


Figure  VII. 3;  A  modified  recursive  doubling  algorithm 
formation. 

Let  Ai  be  the  HISKO  63)  data  from  PE<i). 

Let  Bi  be  the  HIST(64  ->  127)  data  from  PE<i). 


for 


histogram 


202 


**  Step  1:  Exchange  between  PE  i  and  PE  1+1^  i  even. 
INDEX  :=  Pp  ♦  64; 

SET  ICN  to  Cube^; 

DTRin  :=  HISTdNOEX  ->  INDEX  +  63); 

INDEX2  :=  Pp  *  64; 

HIST(INDEX2  ->  INDEX2  +  63)  := 

DTRout  +  HIST<INDEX2  INDEX2  +  63); 

**  Log  N  -  1  steps  yield  HISKO  63)  in  PE(0),  ** 

**  HIST (64  ->  127)  in  PE  1.  ** 

for  i  =  2  step  +1  until  (log  N)  do 
INDEX  :=  Pq  *  64; 

SET  ICN  to  PE-(2’'S; 

DTRin  :=  HISTdNDEX  -»  INDEX  +  63); 

MASK  :x"~'’o’] 

HIST(0  63)  :=  HIST(0  63)  +  DTRout; 

MASK  cx"'”'o''~Sa 

HIST(64  ->  127)  :=  HIST(64  -»  127)  +  DTRout; 
***  Merge  the  two  halves  in  PEIO)  *** 

SET  ICN  to  PC-1; 

DTRin  :=  HIST(64  -»  127); 

MASK  co": 

HIST(64  -»  127)  :=  DTRout; 


Algorithm  VII. 3:  A  parallel  histogram  formation  algorithm. 


/ 

I 


203 


is  equal  to  the  data  transfer  time.  Depending  on  the  network 
implementation,  the  calculation  time  difference  between  the  serial  and 
parallel  algorithms  is  almost  two  orders  of  magnitude.  Note  that,  in 
contrast  to  Algorithm  VII. 1,  the  data  transfer  times  for  the  networks 
discussed  are  independent  of  the  number  of  pixels  stored  in  one  PE. 

Performing  a  data  transfer  time  analysis  similar  to  that  of  section 
5  yields  the  results  given  in  Table  VII. 7.  The  transfers  of  one  step 
of  the  recursive  doubling  algorithm  can  be  done  using  one  Cube  or  Pfl2l 
interconnection  function.  Hence,  the  data  transfer  time  for  the 
recirculating  Cube  and  recirculating  PM2I  networks  are  less  than  the 
times  for  the  multistage  Cube  and  PM2I  networks.  Because  data  is 
transferred  in  blocks,  an  advantage  is  realized  by  using  a  pipelined 
multistage  network. 

The  data  transfer  time  for  the  Illiac  network  indicates  that  the 
Illiac  interconnection  functions  are  not  adequate  to  support  a  recursive 
doubling  algorithm  efficiently.  The  "PE  -1"  connection  requires  only 
one  pass  through  the  network  per  data  item.  However,  the  "PE  -16"  and 
"PE  -512"  connections  each  require  16  passes  through  the  network  for 
each  data  item.  The  total  transfer  time  becomes 

64  (1  +2+4+8+  16  +1  +2+4+8+  16  +  1)(dm  +  drn  +  2  dr) 

=  64  (  63  )  (  4D  )  =  16,128  D. 

For  the  histgram  formation  algorithm,  the  Illiac  network  is  not  a 
good  choice  for  an  interconnection  network.  A  pipelined  multistage 
network  will  afford  the  fastest  algorithm  execution  time.  A 
recirculating  network  is  the  next  best  choice,  and  a  combinational  logic 
multistage  network  the  third  best  choice. 

I 

i 

I 


204 


Table  VII. 7;  Data  transfer  times  for  various  networks  for  a  parallel 
histogram  formation  algorithm. 


network  time 

recirculating  Cube  (  2816  D 

I 

Combinational  logic  ) 

mulistage  Cube  j  8448  D 

I 

Five  stage  pipelined  ] 

multistage  Cube  j  2555  D 

I 

recirculating  PM2I  |  2816  D 

I 

Combinational  logic  I 

multistage  PM2I  [  8448  D 

I 

Five  stage  pipelined  | 

multistage  PM2I  |  2555  D 


Illiac 


16,128  D 


205 


Algorithm  VII. 3  performs  two  simultaneous  recursive  doubling 
algorithms^  and  after  log2  N  steps  half  of  the  bins  of  the  histogram  are 
stored  in  PE(0)  and  half  are  stored  in  PE(1).  A  modification  can  be 
made  to  Algorithm  VII. 3  which  further  devides  the  histogram  such  that  2^ 
simultaneous  recursive  doubling  algorithms  take  place,  and  after  log2  N 
steps,  the  first  2^  PEs  each  have  1/(2^)  of  the  bins  of  the  histogram, 
1  <  b  <  n.  For  example,  an  algorithm  designed  for  b  =  2  stores  1/4  of 
the  bins  of  the  histogram  in  each  of  PEs  0,  1,  2,  and  3  after  log^  N 
steps.  This  algorithm  reduces  the  number  of  items  in  each  block 
transfer.  If  the  histgram  has  128  bins.  Algorithm  VIII. 3  causes  each  PE 
to  transfer  64  data  items  at  a  step.  A  modified  algorithm  for  b  =  2 
causes  each  PE  to  transfer  32  data  items  at  a  step  to  give  1/4  of  the 
histogram  in  four  PEs  after  log^  N  steps.  Since  the  interconnection 
functions  are  the  same  for  both  algorithms,  the  relative  performance  of 
all  of  the  networks  discussed  is  the  same  for  both  Algorithm  VII. 3  and 


the  modified  algorithm 


2C6 


VII. 5.  Image  Data  Classification 

One  method  of  computer  assisted  interpretation  of  images  relies  on 
classification.  The  algorithm  presented  here  is  based  on  a 
classification  method  which  is  often  applied  to  data  collected  by  the 
Earth  Resources  Technology  Satellite^  called  Landsat.  Typically^  ground 
cover  classes  of  interest  are  defined  having  characteristic  spectral 
properties.  If  there  are  M  such  classes,  the  probability  of  membership 
in  one  of  the  M  classes  is  computed  for  each  pixel.  The  highest 
computed  probability  determines  the  class  assignment  of  the  pixel. 

The  LARS  algorithm  was  designed  for  data  vectors  X  which  consist  of 
four  spectral  observations  per  pixel.  Let  X  be  a  d-dimensional  vector 
so  that  for  this  algorithm,  h<i,j)  refers  to  a  vector  of  d  spectral 

p 

observations.  Such  a  vector  in  PE(P)  has  the  name  h  (i,j). 

Presented  below  is  a  parallel  algorithm  for  image  data 
classification  CSWTSD.  The  clustering  algorithm  supplies  two 
parameters,  the  cluster  mean  vector  and  the  covariance  matrix,  for  each 
of  M  clusters  that  are  identified  for  the  data.  The  classification 
algorithm  then  computes  the  probability  of  membership  of  h(i,j)  in  each 
of  the  M  classes.  The  pixel  is  assigned  to  the  class  k  which  yields  the 
highest  probability  Pj^  (h(i,j)>.  The  equation  used  is 

P|^  (  h(i,j)  )  =  b,^  -  1/2  <  (h(i,j)  -  (  c"S  (h(i,j)  -  M^)), 

for  class  k,  0  £  k  £  M-1.  is  the  mean  vector  for  class  k,  is  the 
covariance  matrix  for  class  k,  and  bj^  is  a  predetermined  value  dependent 
on  C|^.  The  superscript  T  denotes  vector  transpose. 

Recall  that  the  512  X  512  image  is  distributed  among  the  1024  PEs 
such  that  each  PF.  stores  a  16  X  16  pixel  portion  of  the  image.  Let  the 


M 


i 

I 

M 

I 


I 


h 


t| 


;j 

i 

1 


207 

machine  be  partitioned  into  partitions  of  size  and  for  simplicity. 
Let  M  =  2^.  Within  each  partition,  the  PE  addresses  are  designated  PE 
0,  1,  ...  ,  and  M-1.  PE(k>  computes  Pj^  for  all  pixels  of  the  partition 
and  has  stored  in  its  memory  all  constants  needed  for  that  computation, 
b|^,  C|^,  and  0  £  k  <  M  CGR76D. 

The  algorithm  computes  Pj^  for  all  pixels  of  the  16  X  16  portion  of 

the  image  in  each  PE.  The  first  step  calls  on  PE(i)  to  read  h^{0,0) 

from  memory,  0  £  i  <  M.  PECO)  computes  P^  for  the  h*^(0,0),  PE(1) 

1 

computes  P^  for  the  h  (0,0),  and  so  forth.  Assume  that  the  intial  class 

membership  and  its  probability  are  zero.  The  new  probability  of  class 

membership  is  compared  to  the  current  assigned  class  probability  for 

h^(0,0).  If  the  new  probability  is  higher  than  the  old,  h^(0,0)  is 

assigned  to  the  new  class.  Next,  0  £  k  <  M,  PE(k)  sends  its  data  to 

PE(k+1  modulo  M),  and  the  process  is  repeated.  h^(0,0),  its  highest 

associated  probabi litity,  and  the  current  class  assignment  are  passed  to 

all  M  PEs  until  all  M  class  probabilities  have  been  computed  and  the 

most  probable  class  assigned,  0  £  i  <  M.  Thus,  the  classes  for  M  pixels 

have  been  computed.  The  process  is  repeated  for  the  remaining  pixels. 

Algorithm  VII. 4  describes  the  classification  algorithm. 

The  algorithm  is  computationally  intensive,  since  Pj^  is  calculated 
2  18 

for  all  512  =  2  pixels,  for  M  values  of  k.  A  serial  computer 

2  2 

algorithm  requires  0(  M  *  d  512  )  time  periods  to  compute  the 

probability  Pj^  for  all  k  and  for  all  pixels  of  the  image.  (Recall 
is  a  d  X  d  matrix.)  The  algorithm  above  introduces  parallelism  to  the 
task  in  several  forms.  First,  it  distributes  the  image  across  1024  PEs. 
Next,  the  1024  PEs  are  divided  into  2*^  partitions  of  size  M  =  2''. 


r 

[ 

( 


T- 

P. 

fj 

li' 

^  I 


i'i 


208 


activate  all  PEs 

for  i  =  0  step  +1  until  15 

for  j  =  0  step  +1  until  15 
class  probability  :=  0; 
class  z-  0; 

current  ;=  h**(i,j);  for  PE(P)^  0  £  P  < 

within  each  partition. 


**  PE<k)  calculates  P|^  for  the  current  pixel  that  is  has.  ** 
for  1=0  step  +1  until  M-1 

P|^  :=  b|^  -  1/2  ((current  -  flj^) 

*  *  (current  - 

if  (  P|^  >  class  probability)  then 

I  class  :=  k; 

I  class  probability  ;=  Pj^; 

**  Move  the  data  to  the  next  PE  for  the  next  P|^,  ** 

■I 

set  ICN  to  PE  +1  modulo  M 


DTRin  :=  (current^  class,  class  probability); 
(current-  class,  class  probability)  :=  DTRout . 


within  each  partition,  M  probabilities  are  computed  at  each  step  of  the 
algorithm.  The  parallel  algorithm  requires  0<  M  *  lb*"  )  time  periods 
to  compute  the  probabilities.  The  speedup  of  the  parallel  algorithm 
over  the  serial  algorithm  is  approximately  three  orders  of  magnitude. 

All  networks  support  this  algorithm  to  some  degree.  For  simplicity, 
assume  that  all  d  spectral  observations  of  h(i,j)  are  transferred  as  one 
datum.  From  Chapter  V,  all  networks  except  the  Illiac  network  can 
partition  the  system  under  single  control  partitioning,  as  is  required 
for  the  algorithm.  The  interconnection  function  needed  is  f(x)  = 
(x  +  1)  modulo  M.  Assume  that  the  three  data  items  to  be  passed,  the 
current  h(i,j),  the  class  of  h(i,j),  and  the  associated  class 
probability,  are  passed  as  a  block  of  data  whenever  possible. 

The  recirculating  Cube  accomplishes  f(x)  using  the  following 
algorithm.  Assume  that  all  PEs  in  a  partition  have  the  same  n-r  most 
significant  address  bits. 

for  j  =  0  step  +1  until  r-1 

MASK 

Cube . ; 

D 

So,  the  recirculating  Cube  network  requires  r  passes  to  implement  f(x). 

A  multistage  Cube  network  implements  f(x)  in  one  pass. 

The  PM2I  network  can  be  partitioned  such  that  all  PEs  in  a 

partition  have  the  same  n-r  least  significant  address  bits.  For  M  =  2^, 

fix)  is  realized  by  the  interconnection  function  PM2. ,  ..  The 

+  ln-r} 

recirculating  and  multistage  PM2I  networks  implement  fix)  in  one  pass 
through  the  network.  The  pipelined  multistage  Cube  and  PM2I  networks 


210 


transfer  the  three  data  items  as  a  block. 

For  M  >  2,  the  ILliac  network  does  not  partition  into  complete 
Illiac  networks  of  size  M,  as  was  seen  in  section  V.3.  If  all  PEs  in  a 
partition  have  consecutive  PE  addresses,  then  the  Illiac^^ 
interconnection  function  routes  the  oata  for  all  PEs  except  PE(M-I)  in 
one  pass  through  the  network.  If  0  ^  M  <  -\/N,  then  M-1  uses  of  Illiac_^ 
route  the  data  from  PE(M-I)  to  PE(0).  The  data  transfer  time  is,  thus, 

M  *  256  *  (  dr  +  dm  +  drn  +  dr  ) 

+  M  *  256  *  (M-1)  *  (2  dr  +  dm  +  drn  )  = 

M  *  256  *  (2  dr  +  dm  +  drn)  (  1  +  M  -  1  ) 

=  *  256  *  (2  dr  +  dm  +  drn)  . 

If  M  >  -v/N,  then  data  can  be  transferred  from  PE (M-1)  to  PE(0)  in 
1  +  M/(->/N)  steps.  This  requires  using  one  Illiac^^  and  M/(VN) 
Illiac  ...  interconnection  functions,  as  follows.  Recall  M  =  2^  and  M  - 
1  =  2^'  -  1.  Then, 

(2^'  -  1)  +  1  -  (VN)  *  2''/(-v'N)  =  0. 

This  yields  a  data  transfer  time  of 

M  *  16^  (  Trc  +  (1  +  M/(VN))  *  Trc  ) 

=  M  *  16^  C  (  M/(VN)  +  2  )  Trc  D 

=  M  *  256  C  (  M/(VN)  +  2  y  (2*dr  +  dm  +  drn)  3. 

Clearly,  the  Illiac  network  is  not  as  well  suited  to  this  algorithm  as 
the  other  networks  discussed. 

Table  VII. 8  lists  the  data  transfer  times  for  these  interconnection 
networks  for  the  image  classification  algorithm.  The  discussion  has 
shown  that  the  data  transfer  time  for  this  algorithm  is  not  negligible. 
The  best  choice  of  interconnection  network  is  the  recirculating  PM2I. 


211 


Table  VII.S.  Delay  times  for  data  transfer  for  the  image  classification 
algorithm^  where  M  is  the  number  of  classes  and  D  =  dms  =  dm  =  dr  =  drn. 


NETWORK  DELAY 


recirculating  Cube  3  M  256  log_  M  (  2  dr  +  dm  +  drn  ) 

=  3072  DM  log-  M 

multistage  Cube  3  M  256  (  2  dr  +  TO  dms  )  =  9206  D  M 

5-stage  pipelined 

multistage  cube  M  256  (  dr  +  (5+3-1  )  (2  dms  +  dr)) 

=  5632  D  M 

recirculating  PM2I  3  M  256  (2  dr  +  dm  +  drn)  =  3072  D  M 
combinational  logic 

multistage  PM2I  3  M  256  (2  dr  +  10  dms)  =  9216  D  M 

5-stage  pipelined 

multistage  PM2I  M  256  (dr+  (5+3-1)  (2  dms  +  dr  )) 

5  5632  0  M 

M^  256  (2  dr  +  dm  +  drn) 

=  1024  DM,  0  <  M  <  -/N. 

M  256  (M/(-/N)  +2)  (  2  dr  +  dm  +  drn) 

=  1024  D  M  (M/32  +  2),  M  >  -/N. 


Illiac 


212 


The  worst  choice  is  the  IlLiac  network  with  its  OCM^)  contribution  to 

the  computation  time  of  the  algorithm.  For  Algorithm  VII. 1,  as  the 

number  of  pixels  stored  in  a  PE  increased,  the  block  size  of  the  data 

transfer  increase,  and  so,  the  pipelined  networks  transferred  data 

faster  than  the  unpipelined  networks.  For  Algorithm  VII. 4,  since  data 

are  not  transferred  in  blocks,  the  phenomenon  will  not  occur. 

A  classification  algorithm  can  be  designed  which  does  not  pass  data 

among  PEs.  For  such  an  algorithm,  each  PE  calculates  all  M 

2 

probabilities  for  all  16  which  it  stores  in  its  memory.  This  method 
uses  more  memory  than  does  Algorithm  VII. 4.  Suppose  that  the  data 
pixels  are  composed  of  four  spectral  observations.  Then,  the  mean 
vector  M  has  length  four,  and  the  covariance  matrix  is  a  4  X  4  matrix. 
If  the  covariance  matrix  is  symmetric,  as  is  often  the  case,  then  only 
10  of  the  16  entries  need  be  stored.  So,  for  each  class,  a  PE  stores  10 
words  (for  the  covariance  matrix)  +  4  words  (for  the  mean  vector)  +  1 
word  (for  bj^)  =15  words.  If  M  =  32,  then  each  PE  stores  480  more  words 
for  the  algorithm  which  does  not  pass  data  than  it  does  for  Algorithm 


213 


VII. 6.  Conclusions 

In  designing  a  parallel  computer  system^  the  selection  of  the 
interconnection  network  is  dependent  upon  the  types  of  algorithms  the 
system  will  execute.  This  chapter  has  formulated  three  parallel  picture 
processing  tasks^  a  smoothing  algorithm^  a  histogram  formation 
algorithm,  and  a  classification  algorithm.  For  each,  the  impact  of 
different  interconnection  networks  on  the  efficiency  of  the  algorithm 
has  been  explored. 

For  the  parallel  smoothing  algorithm,  data  movement  was  in  a 
neighborhood  of  eight  PEs  about  any  PE(i).  Thus,  the  Illiac  network  and 
the  recirculating  PM2I  network  exhibited  the  fastest  data  transfer 
times.  The  recirculating  Cube  network  had  the  slowest  time,  since  a 
bit-serial  addition  or  substraction  algorithm  was  computed  in  order  to 
control  each  data  transfer. 

Due  to  the  recursive  doubling  algorithm,  the  data  transfers  for  the 
parallel  histogram  formation  algorithm  are  more  global  in  nature.  The 
data  transfers  are  not  always  among  PEs  that  are  neighbors.  So,  the 
Illiac  network  requires  the  most  time  to  transfer  the  data.  The  other 
recirculating  networks  can  realize  the  interconnections  in  one  pass 
through  the  network.  Since  the  algorithm  transfers  data  in  blocks,  the 
pipelined  multistage  networks  pass  the  data  in  the  least  amount  of  time, 
and  the  combinational  logic  multistage  networks  are  seen  to  be  a  poor 
choice  for  this  algorithm. 

The  parallel  data  classification  algorithm  illustrates  the  use  of 
single  control  partitioning.  The  2^  PEs  are  partitioned  into  groups  of 
M  PEs  each,  where  M  is  the  number  of  data  classes.  Because  the  Illiac 


network  does  not  partition  into  smaller,  independent,  complete  networks 
for  N  >  2,  this  network  performs  poorly  for  the  classification 
algorithm.  The  interconnection  function  needed  for  the  algorithm  is 
f(x)  =  (x+1)  modulo  M.  For  M  a  power  of  two,  this  function  is  one  of 
the  PM2I  interconnection  functions.  Because  of  this,  the  recirculating 
PM2I  network  needs  the  least  time  to  transfer  data  for  this  algorithm. 

The  analyses  of  Chapter  IV  were  applied  to  three  parallel  image 
processing  algorithms.  The  approach  developed  in  this  chapter  can  be 
generalized  and  applied  to  other  algorithms.  A  given  algorithm  was 
analyzed  for  the  data  transfer  needs,  and  the  effects  of  specified 
interconnection  networks  on  the  performance  of  the  algorithm  were  shown. 


215 


VIII.  M  EMULATOR  NETWORK  FOR 
SIMP  MACHINE  INTERCONNECTION  NETWORKS 

VIII. 1 .  Introduction 

This  chapter  presents  a  single  stage  interconnection  network,  the 
emulator  network,  for  SIMP  computers.  The  emulator  network  can  be  used 
to  simulate  many  types  of  interconnection  networks  in  order  to  make 
informed  choices  for  machine  designs.  In  section  2,  a  hardware  design 
of  the  emulator  network  is  presented.  A  mechanism  for  defining  control 
signals  and  transmitting  data  is  discussed.  Section  3  uses  the  design 
of  section  2  to  simulate  some  multistage  interconnection  networks 
discussed  in  the  literature.  These  include  Pease's  indirect  binary  n- 
cube  [IPEA77.T,  the  STARAN  network  rBA76!I,  the  omega  network  rLAW751,  and 
the  data  manipulator  CFE74T.  Section  4  considers  the  simulation  of  four 
recirculating  networks,  the  Cube  rsiE77al,  the  Shuffle-Exchange  rST71D, 
the  PM2I  rSIE77aT,  and  the  Illiac  CB0U72!].  Section  5  shows  how  some 
useful  permuations  can  be  formed,  and  section  6  considers  Benes-type 
networks.  Section  7  discusses  how  the  emulator  network  can  operate  in  a 
machine  that  is  partitioned  into  smaller  SIMP  machines,  each  of  size  a 
power  of  two.  Section  8  shows  how  the  emulator  network  can  be  used  to 
simulate  the  interconnections  in  two  multiple  intruction  stream  - 


VIII. 2.  The  Emulator  Network 

The  emulator  network,  is  a  recirculating  interconnection  network 
that  consists  of  the  2n-1  distinct  PM2I  functions  and  the  Shuffle 
function.  Figure  VIII. 1  shows  a  design  for  the  emulator  network  as  seen 
from  PE(i).  The  emulator  network  can  be  built  from  N(2n+1)  tristate 
buffers  for  each  bit  of  network  width.  Each  interconnection  function 
corresponds  to  one  of  the  2n  transmission  buffers,  and  one  buffer 
receives  signals  at  the  output.  If  the  maximum  delay  of  a  buffer  (such 
as  the  SN742A0)  is  7  ns,  then  the  delay  of  the  data  transfer  would  be  14 
ns  plus  the  delay  of  the  transmission  lines  between  the  transmitting  and 
receiving  buffers. 

The  emulator  system  can  be  visualized  in  two  ways.  Physically,  it 
is  a  set  of  N  PEs  and  the  emulator  network.  Logically,  it  can  be  viewed 
as  a  set  of  N  nodes,  where  node  i  consists  of  PE(i>,  input  position  i  of 
the  network  (i.e.,  the  2n  transmission  buffers),  and  output  position  i 
of  the  network  (i.e.,  the  receiving  buffer). 

Each  PE  supplies  the  control  bits  for  its  part  of  the  emulator 
network  (i.e.,  PE(i)  for  node  i)  by  means  of  a  2n  bit  Routing  Control 
Register  (RCR) .  For  each  pass  through  the  network,  each  PE  calculates 
where  to  send  its  data,  then  places  the  appropriate  control  bits  in  RCR. 
These  bits  go  to  the  controls  of  the  buffers  at  the  input  of  the 
emulator  network,  each  bit  of  the  RCR  corresponding  to  a  different 
interconnection  function.  Figure  VIII. 2  shows  the  configuration  of  the 
emulator  system. 

A  routing  instruction  is  provided  so  that  each  PE  can  load  its  RCR 


with  the  appropriate  controls  for  the  given  data  transfer 


For 


SHUFFLE 


to  SHUFFLE (i) 


+n-2  - 

data  in 
f  roin  PE  ( i ) 


HUFFLE 


from  PM2 


Figure  VIII. 1:  A  node  of  the  emulator  network  for  PE(i)  for  N  =  2  PEs 


can  be  built  using  N  (  2  log-  N  +  1)  tristate  buffers.  Control  signals 


are  provided  by  a  register  in  PE(i> 


219 


:  Architecture  of  the  Emulator  System  for  PE(i) 


220 


simplicity,  assume  the  instruction  has  2n  one-bit  fields,  one  for  each 
Pfi2l  function  and  one  for  the  Shuffle  function.  If  a  field  is  0,  the 
function  is  disabled.  If  a  field  is  1,  the  corresponding  function  is 
enabled.  A  TRANSFER  instruction  initiates  the  data  transfer  which  uses 
the  controls  in  RCR. 

From  these  instructions,  useful  macro  instructions  may  be 
constructed.  For  example,  a  CubOp  function  can  be  implemented  by 
causing  PEs  whose  least  significant  address  bit  is  0  to  execute  P^^^p 
while  PEs  whose  least  significant  bit  is  1  execute  PM2_p.  This  process 
can  be  generalized  to  attain  any  Cube,  function.  For  notational 
purposes,  define  Cube,  as  the  following  macro,  where  p^_.j...p^Pp  is  a  PE 
address. 

Define  Cube.; 

1 

RCR  :=  0; 

RCR  :=  p.  *  (Pf12  .)  +  p.  ■  (PM2^.); 

1  -1  1  +1 

TRANSFER; 

where  +  is  a  logical  OR  operation  and  '  is  a  logical  AND.  This  means 
that  PEs  where  p^  =  0  execute  PM2^.,  and  PEs  where  p.  =  1  execute  PM2_.. 

Lemma  VIII. 1:  The  macro  Cube,  above  causes  a  Cube,  function  to  be 

-  1  1 

executed. 

Proof; 

Case  1:  PE(P),  where  p.  =  0.  Then,  PM2^.  routes  the  data  to  the  PE 

whose  address  is  P  +  2^  modulo  N  =  Cube.(P). 

Case  2:  PE(P),  where  p.  =  1.  Then,  routes  the  data  to  the  PE 

whose  address  is  P  -  2^  modulo  N  =  Cube.(P). 


221 


cn 

For  simplicity  in  the  algorithms  below,  the  following  assumptions 
are  made: 

(1)  data  to  be  transferred  from  PE(i)  to  PE(j)  start  in  the  OTRin  of 
PE<i)  and  must  end  in  the  DTRin  of  PE(j),  0  £  i,  j  <  N;  and 

(2)  the  network  always  takes  the  data  in  DTRin  as  its  input. 


r 


222 


VIII .3.  Simulation  Results ;  Multistage  Networks 

The  indirect  binary  n-cube  network  rPEA77]l  (Figure  III. 5)  is  based 
on  the  Cube  interconnection  functions.  In  simulating  this  network,  the 
PEs  of  the  emulator  network  can  calculate  controls  at  each  stage  and 
apply  one  of  the  Cube,  macros  defined  above. 

Theorem  VIII. 1 ;  The  emulator  network  can  simulate  the  n-stage  indirect 
binary  n-cube  network  in  n  passes. 

Proof;  By  definition,  stage  i,  P  £  i  <  n,  of  the  indirect  binary  n-cube 

network  executes  Cube,  with  individual  box  control  and  two  function 

1 

interchange  boxes.  In  CPEA77],  it  is  proposed  that  a  microprogrammable 
microcomputer  called  a  switch  controller  generate  the  settings  for  the 
interchange  boxes.  Each  switch  controller  provides  controls  for  a  "set" 
of  interchange  boxes.  Each  control  is  either  "exchange"  or  "straight." 
The  program  which  the  switch  controller  uses  to  generate  the  controls 
could  be  run  in  each  PE  of  the  emulator  system  in  order  to  simulate  the 
effects  of  the  switch  controller.  Since  each  PE  need  compute  only  the 
control,  c,  for  node  i,  the  program  run  is  a  subtask  of  the  program  that 
the  switch  controller  would  use. 

An  algorithm  that  uses  the  emulator  network  to  simulate  an  indirect 
binary  n-cube  is 

activate  all  PEs, 

RCR  ;=  0; 

for  i  =  0  step  +1  until  n-1  do 

calculate  interchange  box  control  c; 
if  c  =  "Exchange"  then  Cube.; 


OTRin  :=  OTRout 


Cube,  is  the  macro  instruction  defined  in  section  VIII. 2.  Since  stage  i 
of  the  indirect  binary  n-cube  executes  Cube.,  0  £  i  <  n,  the  results  of 
the  algorithm  are  the  same  as  the  actions  of  the  indirect  binary  n-cube. 
The  action  of  the  emulator  network  nodes  j  and  i+2^  simulate  the  action 
of  the  interchange  box  at  stage  i  whose  inputs  are  j  adn  j+2\ 
0  £  j  <  N,  and  the  i^^  bit  of  j  is  0.  Because  data  in  the  n-cube  travel 
through  stage  0,  then  stage  1,  etc.,  the  loop  of  the  algorithm  goes  from 
i  =  0,  to  i  =  1,  etc.  CT 

Banyon  networks  are  a  class  of  interconnnection  networks  which  are 
defined  in  terms  of  graph  representations.  The  graph  of  a  banyon 
network  is  a  Basse  diagram  of  a  partial  ordering  in  which  there  is  one 
and  only  one  path  from  any  input  vertex  to  any  output  vertex  rGL73T.  F, 
the  fanout,  is  the  number  of  arcs  incident  into  each  vertex  at  a  stage 
from  the  last  stage,  and  S,  the  spread,  is  the  number  of  arcs  incident 
out  from  each  vertex  to  the  next  stage.  For  S  =  F  =  2,  each  vertex  of 
the  graph  represents  a  2  X  2  crossbar  switch,  i.e.  an  interchange  box. 
One  type  of  banyon  structure  is  the  SVJ  banyon,  shown  in  Figure  VIII. 3 
with  its  corresponding  graph  for  N  =  8. 

Corollary  VIII , 1 ;  The  emulator  network  can  simulate  an  n-stage  SW  banyon 
network  for  S  =  F  =  2  in  n  passes. 

Proof;  From  rCL73],  the  recursive  definition  of  a  banyon  is  as  follows; 

(1)  A  one  level  SW  banyon  structure  with  fanout  F  and  spread  S  is  a 

crossbar  with  F  bases  and  S  apexes. 


i 

t 


(2)  An  L-LeveL  SV)  banyon  structure  with  fanout  F  and  spread  S  can  be 
synthesized  by  interconnecting  S*"  ^  crossbars  and  F  identical  (L-1) 
level  SW  banyon  structures,  all  with  fanout  F  and  spread  S.  The 
apexes  of  the  SW  structures  are  connected  to  the  bases  of  the 
crossbars  such  that  the  interconnection  graph  is  a  crossbar.  Also, 
each  crossbar  must  be  connected  to  every  component  SW  structure  in 
the  same  way,  i.  e.,  if  it  is  connected  to  the  i^^  apex  of  one  SW, 
it  must  be  connected  to  the  i^^  apex  of  each  of  the  others. 

So,  an  SW  banyon  with  S  =  F  =  2  has  the  same  topology  as  the  n-stage 
indirect  binary  n-cube  network.  From  Theorem  VIII.I,  the  emulator 
network  can  simulate  an  n-stage  SW  banyon  network  for  S  =  F  =  2  in  n 

passes  through  the  network.  CD 

The  STARAN  flip  network  (Figure  III. 4)  is  also  a  multistage  cube- 
type  network.  The  emulator  system  simulates  the  STARAN  flip  network 
with  either  flip  or  shift  controls  with  the  aid  of  the  Cube,  macros  and 
the  computation  facilities  of  the  PEs. 

Corollary  \/III.2:  The  emulator  network  can  simulate  the  n-stage  STARAN 
flip  network  with  flip  or  shift  controls  in  n  passes. 

Proof;  The  STARAN  flip  network  CBA76D  has  the  same  topology  as  the 
indirect  binary  n-cube  with  two  function  interchange  boxes,  but  has  a 
different  control  structure  rsIS78D.  The  flip  control  for  the  STARAN 
flip  network  is  individual  stage  control.  For  the  flip  control,  the 
STARAN  uses  the  control  vector  F  =  ■**  ^0^'  where  f.  =  1  causes 

stage  i  to  execute  Cube..  An  algorithm  using  the  emulator  network  to 
simulate  the  STARAN  network  under  flip  control  is 


» 


226 


activate  all  PEs^ 

RCR  :=  0; 

CU  (control  unit)  broadcasts  F  to  all  PEs; 
for  i  =  0  step  +1  until  n-1  do 
if  f.,  then  Cube.; 

DTRin  :=  DTRout. 

The  shift  controls  for  STARAK  provide  i+1  control  signals  at  stage 
i  in  order  to  perform  shifts  of  2'"  modulo  2^,  0  <  m  <  p  <  n.  Figure 
III.Ab  shows  the  shift  controls  for  the  STARAN  network  for  N  =  8.  Each 
control  signal  is  one  bit,  where  0  indicates  straight  and  1  indicates 
Cube^  at  stage  i.  Recall  the  definition  of  the  shift  control  signals 
from  Chapter  V.  At  stage  i,  let  the  i+1  control  signals  be  labeled  ciO, 
ci1,  cii.  Stage  0  has  one  control,  cOO,  that  affects  all  PEs.  The 

signals  control  groups  of  PEs  in  the  following  manner  at  stage  i,  i  >  0. 
ciO: 

ci1:  X  0  1; 

ci2:  x"'‘'o''“^1X; 

•  •  • 

„n-i.„i-1 
cn  :  X  IX 

Assume  that  the  shift  controls  can  be  broadcast  to  the  PEs  as  a 
vector,  c.,  of  length  n  that  is  padded  with  zeroes,  as  needed.  For 

example,  for  N  =  64,  c^  =  <  0  0  0  c33  c32  c31  c30  ).  Let  each  PE 

precompute  and  store  an  n  bit  mask  for  stage  i,  MASK.,  which  has  "1"  in 
position  j  if  the  PE  is  affected  by  cij,  0  £  i  <  n.  For  example,  let 

N  =  64,  and  n  =  6.  For  stage  0,  all  PEs  are  affected  by  cOO.  So, 

MASKp  =(000001).  For  stage  1,  all  even  PEs  are  affected  by  clO. 


227 


For  these  PEs^  MASK^  =<000001).  All  odd  PEs  are  affected  by 
c11,  and  so  for  these  PEs,  MASK^  =(000010).  If  the  data  words 
of  the  system  are  16  bits  long,  then  for  N  <  2^^  =  65,536,  only  n  words 
of  storage  are  needed.  To  compute  the  control  at  stage  i  for  a  PE, 
MASK,  is  ANDed  with  c..  If  the  result  is  not  zero  (all  O's),  then  Cube, 
is  executed  by  the  PE.  An  algorithm  that  uses  the  emulator  network  to 
simulate  the  STARAN  network  with  shift  controls  is 

activate  all  PEs, 

RCR  :=  0; 

for  i  =  0  step  +1  until  n-1  do 

CU  broadcasts  stage  i  controls,  c^,  to  all  PEs; 

CNTRL  :=  c.  *  MASK.; 

if  N0T(  CNTRL  =  0  )  then  Cube.; 

1 

OTRin  :=  DTRout . 

The  shifts  of  2™  modulo  2^  could  be  more  efficiently  implemented  using 
the  PM2I  functions  of  the  emulator  network.  For  example,  the  shifts  2"’ 
modulo  2*^  are  the  PM2I  functions  PM2^  .  However,  the  algorithm  above 
more  closely  simulates  the  actions  of  the  STARAN  flip  network.  As  in 
the  proof  of  Theorem  VIII. 1,  the  action  of  emulator  nodes  j  and  j+2’ 
simulate  the  action  of  the  interchange  box  at  stage  i  whose  inputs  are  j 
and  j+2^,  0  £  j  <N,  and  the  i^^  bit  of  j  is  0.  Because  data  in  the  flip 
network  pass  first  through  stage  0,  then  stage  1,  etc.,  the  loop  index  i 
in  the  algorithm  goes  from  0  to  n-1.  CD 


22R 


The  generalized  cube  network  rsiS78D  (Figure  III. 6)  is  defined  as  a 
standard  topology  for  multistage  Cube  networks,  and  may  be  used  with 
either  type  of  interchange  box  and  any  of  the  three  multistage  control 
structures.  The  .’^21  interconnection  functions  of  the  emulator  network 
can  be  used  to  simulate  the  broadcast  states  of  four  function 
interchange  boxes.  For  example,  an  upper  broadcast  from  PECO)  to  PE(1) 
is  accomplished  by  saving  the  data  in  PE(0)  and  also  routing  it  to 
PE(Pri2^P  (0))  =  PE(1), 

Theorem  VIII. 2;  The  emulator  network  can  simulate  the  n-stage 
generalized  cube  network  in  n  passes. 

Proof:  Assume  the  most  flexible  construction  and  control,  that  of  four 
function  interchange  boxes  and  individual  box  control.  In  CSIS78!],  no 
particular  method  of  generating  control  signals  was  discussed.  Assume 
that  each  PE  in  the  emulator  system  can  generate  a  pair  control  bits, 
call  them  c,  for  each  pass  through  the  emulator  network  (one  stage  of 
the  generalized  cube).  An  algorithm  which  uses  the  emulator  network  to 
simulate  a  generalized  cube  is  as  follows,  where  p.  is  the  i^^  bit  of 
the  PE  address. 

activate  all  PEs, 

RCR  :=  0; 

for  i  =  n-1  step  -1  until  0  do 

if  p.  =  0  and  c  =  ("exchange"  or  "upper  broadcast") 
then  RCR  :=  PM2^^; 

if  p.  =  1  and  c  =  ("exchange"  or  "lower  broadcast") 
then  RCR  :=  PM2_.; 


TRANSFER; 


DTRin  :=  DTRout . 

After  n  steps,  the  algorithm  routes  the  data  in  the  same  way  as  the 
generalized  cube.  As  in  the  proof  of  Theorem  VIII. 1,  emulator  nodes  j 
and  j+2’  act  as  an  interchange  box.  Since  the  data  pass  first  through 
stage  n-1,  then  stage  n-2,  etc.,  the  loop  index  i  goes  from  n-1  to  0.  CH 

Each  stage  of  an  n-stage  Shuffle-Exchange  network  is  a  shuffle 
interconnection  possibly  followed  by  an  exchange  (Figure  III. 3), 
depending  on  the  control  settings. 

Corollary  VIII. 3:  The  emulator  network  can  simulate  the  n-stage 
Shuffle-Exchange  network  in  n  passes. 

Proof;  The  n-stage  Shuffle-Exchange  network  is  topologically  equivalent 
to  the  generalized  cube  rsiS7f?1.  Using  Theorem  VIII. 2,  this  network  can 
be  simulated  in  n  steps.  Cl 

Lang  and  Stone  rLAST761  have  presented  a  simplified  control  scheme 
for  a  Shuffle-Exchange  network  with  two  function  interchange  boxes.  The 
control  bits  for  the  interchange  boxes  are  calculated  from  those  of  the 
previous  stage.  For  N  =  8,  Figure  VIII. 4  shows  an  expanded  Shuffle- 
Exchange  network  with  this  simplified  control.  Let  the  first  stage  of 
the  network  be  labeled  stage  n-1.  For  i  even,  C  (i)  is  the  control  for 
interchange  box  for  data  lines  i  and  i+1  at  stage  n-k,  where  C  (i)  =  1 

|y 

means  exchange,  and  C  (i)  =0  means  straight,  1  <  k  £  n.  Note  that 

Ic  k 

C  (i)  =  C  (i+1).  For  i  even  and  k  >  2,  the  control  at  stage  n-k  is 


270 


c''(i)  -  (Shuffle"^  (i))  o  (Shuffle"^  (i+1 ) ) 


where  o  is  a  particular  boolean  operation.  The  initial  condition, 
c‘'(i),  is 


c’^Ci)  =  0<Shuffle”''(i))  , 

n~i 

for  i  even,  where  is  the  most  significant  bit  of  the  destination 

of  data  from  PE(j).  As  an  example,  let  N  =  S  and  let  the  boolean 
operation  be  the  exclusive  or  function.  Figure  VIII.S  realizes  a  cyclic 

shift  of  3,  that  is,  PE(x)  sends  data  to  PE(x  +  3  modulo  8).  The 

initial  conditions  are 
C*’  (0)  =  0, 

(1)  =  n, 

C**  (2)  =  1, 

(3)  =  1, 

(4)  =  1, 

(5)  =  1, 
c”*  (6)  =  1, 

(7)  =  1. 

Lang  has  shown  that  if  the  boolean  function  for  the  simplified 
control  is  the  exclusive  or  function,  then  all  uniform  shifts  of  PE(x) 
to  PE(x  +  j  modulo  N)  can  be  accomplished  by  the  network.  If  the 

equivalence  function  (  f(x,y)  =  1  if  and  only  if  x  =  y  )  is  used,  then 

data  can  be  passed  from  PE(x)  to  PE(<t  -  x)  modulo  N),  0  £  x  <  N,  t  an 
integer. 


In  CLAST763,  it  is  proposed  that  the  boolean  function  is  hardwired 
into  the  control  structure  of  the  network.  The  emulator  control  is  not 
bound  by  the  hardware  in  this  way,  and  the  PEs  can  choose  any  boolean 


/ 


;  A  Shuffle-Exchange  network  with  simplified  control  for  N 
lines  represent  paths  of  control  signals.  The  control  bits 
re  calculated  from  those  of  the  previous  stage.  IILAST76D 


233 


function  to  be  implemented.  Thus,  the  effects  of  different  operators 
may  be  simulated. 


Corollary  VIII. 4;  The  emulator  network  can  simulate  the  n-stage 
Shuffle-Exchange  network  with  simplified  controls  in  n  passes. 

Proof;  Assume  that  the  control  bits  at  each  stage  are  passed  with  the 
data  they  affected,  as  in  TLASTTbU.  At  any  stage,  for  each  datum,  call 
this  bit  C.  After  a  Shuffle  transfer  at  stage  k,  C  is  either 
C*'  ^(Shuffle  ^(i))  or  ^(Shuffle  \i+1)),  as  mentioned  above  (Figure 
VIII. 4),  for  i  even,  k  ^  2.  Assume  that  a  destination  D^.  =  d^_^...dQ  is 
initially  specified  for  data  from  PE(j).  Since  the  multistage  Shuffle- 
Exchange  network  is  topologically  equivalent  to  the  generalized  cube 
network,  the  algorithm  below  is  based  on  Theorem  VIII. 2.  This  could  be 
done  in  n  passes,  as  follows.  Let  each  PE  keep  a  copy  of  the  data  as 
well  as  send  it  to  the  PE  that  differs  in  the  i^^  bit  position.  If  the 
control  bit  is  0,  the  retained  data  is  moved  to  DTRin.  For  clarity,  the 
following  algorithm  simulates  the  Shuffle-Exchange  network  with 
simplified  control  in  2n  passes. 

activate  all  PEs, 

RCR  :=  0; 

DATA  :=  DTRin; 

***  Initialize  controls  *** 


C  •  s!  ri  • 

c  *  Vr 

if  p  .  =  0  then  DTRin 
^n-1 

RCR 


c; 

PM2 


+(n-1)' 


TRANSFER; 


2!54 


if  =  1  then  C  ;=  DTRout; 

DTRin  :=  DATA; 
if  C  then  Cube 

n-V 

DATA  :=  DTRout; 
for  k  =  2  step  +1  until  n  do 
***  Pass  controls  for  adjacent  PEs.  *** 
DTRin  ;=  C; 

C1  :=  DTRout; 

Calculate  new  control.  *** 
c'‘  ;=  C  o  C1; 


DTRin  ;=  DATA; 

if  C  then  Cube  .  ; 

n“K 

DATA  ;=  DTRout; 

DTRin  :=  DATA. 

If  the  data  item  and  the  C  bit  requires  more  bits  than  the  network 
widths  multiples  of  n  passes  will  be  needed.  Cil 

The  omega  network  CLAWTSD  is  an  n-stage  Shuffle-Exchange  network. 
One  method  of  controlling  the  omega  network  is  by  associating  with  each 
datum  the  address  of  the  destination  PE.  This  address  is  called  a 
destination  tag.  The  omega  network  considers  only  two  function 
interchange  boxes  for  this  type  of  control.  Let  a  destination  tag 
D.  =  d^_^...d^  be  passed  with  the  datum  originally  from  PE(i), 
0  £  i  <  M.  At  stage  j,  if  dj  =  0  for  the  upper  input  of  the  interchange 


235 


box^  the  interchange  box  is  set  to  straight.  If  =  1  for  the  upper 

input,  the  interchange  box  is  set  to  exchange.  The  jth  bit  of  the 
destination  tags  of  the  lower  and  upper  inputs  must  be  complements,  or 
else  an  error  occurs. 

Theorem  VIII .3;  Let  the  data  word  to  be  transfered  be  W  bits  wide,  and 
let  a  destination  tag  associated  with  the  data  word  be  n  bits  wide. 
Consider  an  n  stage  omega  network  with  a  width  of  W+n  bits.  The 
emulator  network  can  simulate  an  n  stage  omega  network  with  destination 
tags  for  control  in  n  passes. 

Proof:  The  omega  network  is  topologically  equivalent  to  the  generalized 
cube  CSIS78I1,  and  an  algorithm  based  on  Theorem  VIII. 2  simulates  the 
omega  network  in  n  passes. 

activate  all  PEs, 

RCR  :=  0; 

for  i  =  n-1  step  -1  until  0  do 

if  p.  =  d.  then  Cube.; 

^11  1 ' 

DTRin  :=  DTRout . 

If  the  width  of  the  network  is  less  than  the  length  of  D.  plus  the 
length  of  the  data,  multiples  of  n  passes  through  the  network  are 
needed.  C? 

Theorem  VIII. The  emulator  network  can  simulate  the  n-stage  omega 
network  in  n  passes. 

Proof  I  The  omega  network  is  an  n-stage  Shuffle-Exchange  network  with 
four  function  interchange  boxes  l!LAW75D.  In  CLAW75D,  no  control  scheme 


236 


for  four  function  interchange  boxes  is  discussed.  Assume  the  control 
convention  used  in  Theorem  VIII. 2.  Since  the  omega  network  is 
topologically  equivalent  tc  a  generalized  cube  network  with  four 
function  interciiange  boxes  and  individual  cell  control  CSIS783,  the 
algorithm  to  simulate  the  omega  network  is  the  same  as  the  one  in 
Theorem  VIII. 2. 

If  j  passes  through  the  omega  network  are  required  to  realize  a 
given  data  transfer,  then  j*n  passes  through  the  emulator  network 
realize  the  same  transfer.  TH 

The  data  manipulator  CFE74T  is  an  n-stage  PW2l  network  where  stage 
i  performs  straight.  For  the  control  of  the  network 
discussed  in  CFE7A],  each  stage  of  the  network  is  controlled  by  a  pair 
of  signals,  specifying  two  of  H^,  H^,  U^,  U^,  and  0^  as  defined  in 
Chapter  III  and  shown  in  Figure  III. 7. 

Theorem  VIII. 5;  The  emulator  network  can  simulate  the  n-stage  data 
manipulator  network  in  n  passes. 

Proof:  The  control  signals  for  the  data  manipulator  are  supplied  for 
each  stage  by  external  control.  With  the  emulator  network,  at  pass  k, 
PE(j)  controls  cell  j  of  stage  n-k.  For  each  pass,  call  the  two  control 
signals  supplied  to  the  stage  C1  and  C2,  and  let  them  be  broadcast  to 
all  PEs.  To  simulate  the  data  manipulator  network  with  the  emulator: 


237 

activate  all  PEs, 

RCR  :=  0; 

for  i  =  n-1  step  -1  until  0  do 
clear  RCR; 

DTRout  :=  DTRin; 

CU  broadcasts  Cl  and  C2; 
if  p.  =  0  then 


if 

C1  =  U 

then 

RCR 

:=  PM2_.; 

if 

Cl  =  D 

then 

RCR 

:=  RCR  +  PM2^.; 

else  if 

C2  =  U 

then 

RCR 

:=  PM2_.; 

if 

C2  =  D 

then 

RCR 

:=  RCR  +  PM2^.; 

+  1 

TRANSFER; 

DTRin  :=  DTRout.  Cl 


The  augmented  data  manipulator  (ADM)  network  CSIS78D  is  based  on 
Feng's  data  manipulator  rFE741.  It  allows  each  of  the  N  cells  at  each 
stage  to  receive  its  own  control  signals,  any  combination  of  H,  U,  and 

D. 

Corol lary  VIII. 5;  The  emulator  network  can  simulate  the  n-stage 
augmented  data  manipulator  (ADM)  network  in  n  passes. 

Proof:  to  particular  method  of  generating  the  control  signals  for  the 
ADM  is  specified  in  rsiS781.  It  is  assumed  that  for  each  data  item  the 
encoded  control  information  for  stage  i,  C(i),  is  passed  with  the  data, 
and  can  be  any  combination  of  one  to  three  of  H,  U,  and  D. 


238 


activate  all  PEs, 

RCR  :=  n; 

for  i  =  n-1  step  -1  until  0  do 
clear  RCR; 

DTRout  :=  DTRin; 

if  C(i)  =  U  then  RCR  :=  PM2 

if  C<i)  =  0  then  RCR  :=  RCR  +  PM2^^; 

TRANSFER; 

DTRin  :=  DTRout. 

If  the  data  item  and  the  n  C(i)'s  require  more  bits  than  the  width 
of  the  network,  multiples  of  n  passes  will  be  required.  If  j  passes 
through  the  ADM  are  required  to  realize  a  given  data  transfer,  then  j*n 
passes  through  the  emulator  network  can  realize  the  same  data  transferee 


A  CC  banyon  network  can  be  luid  out  as  a  crosshatch  pattern  on  the 
surface  of  a  cylinder.  CC  is  an  acronym  for  Cylindrical  Crosshatch 
CGL73e.  Figure  VIII. 6  shown  a  three  stage  CC  banyon  network  for  N  =  8 
and  S  =  F  =  2. 


Theorem  VIII. 6:  The  emulator  network  can  simulate  an  n-stage  CC  banyon 
network  for  S  =  F  =  2  in  n  passes. 

Proof:  A  CC  banyon  network  CGL73],  with  fanout  F  and  spread  S,  is 
defined  as  an  L  stage  network  with  N  =  S*"  vertices  at  each  level.  There 
is  an  arc  from  vertex  Vij^  at  stage  k  to  vertex  at  stage  k+1  if  and 
only  ifj=(i+m*S)  modulo  N  for  some  m,  0  <  m  <  S.  For  S  =  2,  m  < 
T0,1>.  Thus,  at  stage  k,  since  F  =  2,  each  vertex  Vij^  has  connections 


Figure  VIII. 6;  The  graph  of  a  CC  banyon  network  for  S  =  F  =  2  and  N 
C6L73]  connects  at  stage  k  to  Vjj^  at  stage  k+1  if  and  only  if  j 
Dr  j  =  (i  +  2  )  modulo  N-  0  <  k  <  n. 


240 


l( 

to  and  to  where  j  =  (i  +  2  )  modulo  N.  Since  the  emulator 

network  provides  all  PM2^^.  functions^  0  ^  i  <  n,  it  can  simulate  the  CC 
banyon  network  for  S  =  F  =  2.  CD 

Corollary  VIII. 6:  The  emulator  network  can  simulate  any  CC  banyon  for 

S  =  F  =  2^  and  N  =  S^. 

s  s 

Proof:  For  S  =  2,m<'C0,  1,  ...,2  -1>.  The  emulator  provides 

all  PM2I  functions^  and  so  the  connections  from  Vij^  to  for  j  =  (i 

+  m  *  S  )  modulo  N  can  be  realized,  although  multiple  passes  through  the 
emulator  network  may  be  required  for  each  level  of  the  CC  banyon 
network.  For  example,  if  S  =  2^,  then  m  <  -C  0,  l,  2,  3  >.  The 
connection  from  Vij^  to  for  j=  (i+3*2^)  modulo  N  =  (  i  +  8 

+  4  )  modulo  N  can  be  realized  in  two  passes  through  the  emulator 
network  by  using  the  interconnection  functions  and  PM2^2- 

Chapter  IV  introduced  the  Shuffle  -  No  Shuffle  -  Exchange  (SNSE) 
network.  At  each  stage  of  this  n-stage  network,  the  options  are  no 
change.  Shuffle,  Exchange,  or  Shuffle-Exchange.  This  network  can 
perform  1  to  n-1  Shuffles  in  one  pass  through  the  network,  as  well  as 
perform  all  the  data  transfers  of  a  multistage  Shuffle-Exchange  network. 

Theorem  VIII .7:  The  emulator  network  can  simulate  an  n-stage  SNSE 
network  in  at  most  2n  passes. 

Proof;  An  algorithm  which  simulates  the  SNSE  network  is 


241 


activate  all  PEs, 

RCR  :=  0; 

for  i  =  n-1  step  -1  until  0  do 

CD  broadcasts  SNSE  controls; 

if  (Shuffle  or  Shuffle-Exchange)  then  RCR  :=  Shuffle; 

TRANSFER; 

OTRin  :=  DTRout; 

if  (Exchange  or  Shuffle-Exchange)  then  CubCp; 

DTRin  :=  DTRout.  C] 

The  algorithm  above  requires  at  most  2n  passes  to  transfer  the 
data.  If  the  SNSE  controls  specify  no  shuffle  and  no  exchange,  then  the 
emulator  network  need  not  pass  data.  If  the  SNSE  controls  specify  a 
shuffle  and  an  exchange  at  each  stage,  then  the  emulator  network  used  2n 
passes  to  accomplish  the  data  transfer. 


I 


242 


VIII .4.  Simulation  Results;  Recirculating  Networks 

Chapter  II  defined  a  recirculating  network  as  an  interconnection 
network  that  uses  a  single  stage  of  switches  to  route  data  among  PEs. 
Chapter  IV  presented  hardware  designs  and  data  transfer  time  comparisons 
for  several  recirculating  networks,  in  particular,  the  Shuffle-Exchange, 
the  Cube,  the  PM2J.,  and  the  Illiac.  The  emulator  network  can  simulate 
these  recirculating  networks. 

Theorem  VIII. 8;  The  emulator  network  can  simulate  a  recirculating 
Shuffle-Exchange  network. 

Proof;  A  recirculating  Shuffle-Exchange  network  allows  each  PE  to 
execute  either  a  Shuffle  interconnection  function  or  an  Exchange 
interconnection  function  for  each  transfer  rsiE77a].  The  Shuffle  is  one 
of  the  interconnection  functions  provided  by  the  emulator  network,  and 
the  Exchange  =  Cube_.  CT 

Theorem  VIII. 9;  The  emulator  network  can  simulate  a  recirculating  Cube 
network. 

Proof;  In  a  recirculating  Cube  network,  all  active  PEs  execute  the  Cube, 
interconnection  function  that  is  broadcast  by  the  control  unit  CSIE77a]- 
The  emulator  network  can  execute  the  macro  instruction  Cube.  (Lemma 
VIII. 1).  n 

Theorem  VIII .IH;  The  emulator  network  can  simulate  the  recirculating 


PM2I  network. 


r 


243 

Proof;  In  a  recirculating  PM2I  network,  all  active  PEs  execute  the 
single  PM2I  interconnection  function  that  is  broadcast  by  the  CD 
rsiE77aD.  The  2n-1  distinct  PM2I  functions  are  included  in  the  emulator 
network. Cl 

Corollary  VIII. 7;  The  emulator  network  can  simulate  the  Illiac  network. 

Proof:  The  four  Illiac  interconnection  functions  CBOUTZ?  are  a  subset  of 

the  Pf12I  functions,  where  Illiac^^  (P)  =  PM2^j^  (P),  Illiac_^  (P)  =  PM2_q 

(P),  Illiac^  (P)  =  PM2^  (P),  and  Illiac  ...  (P)  =  PM2  (P) .  CD 
+-vN  +n/2  '  — VN  -n/2 


244 


VIII. 5.  Other  Useful  Permutations 

The  emulator  network  can  simulate  the  perfect  shuffle  of  size  2*^  ^ 

n**k 

and  the  inverse  perfect  shuffle  of  size  2  ,  for  any  0  £  k  <  n.  Figure 

VIII.?  illustrates  the  perfect  shuffle  and  its  inverse  for  N  =  8. 

Theorem  VIII. 11 ;  Tlie  emulator  network  can  simulate  an  inverse  perfect 

shuffle  of  size  2*^  ^  in  n-k-1  passes,  0  £  k  <  n. 

Proof;  For  PE(P)  =  p^_^...Pp,  at  step  i,  the  simulation  algorithm  has 

the  effect  of  comparing  bits  p.  and  and,  if  different,  sending  the 

data  to  PE  p^_^...p.^2PiPi+iPi-i***Po' 

for  i  =  n  step  +1  until  n-k-2  do 

if  p.  ^  p^^^  then  RCR  :=  p.  *  (PM2+^>  +  '  (PM2_.); 

TRANSFER; 

DTRin  :=  DTRout. 

The  case  p.  *  (PM2^J  sends  data  from  PE  p^_^...01p^_^...PQ  to 

p^_^...10p._^...Pp.  The  case  p.  *  (PM2_.)  sends  data  from  PE 

p  ,...10p.  ....Pr,  to  p  ....Olp.  ....p-.  Thus,  after  n-k-1 

n~i  i~i  u  n—i  i*'i  u 

comparisons,  the  data  originally  in  P  =  p^_^...p^pg  is  transferred  to  PE 
p^_^...p^_I^PpP^_l^_^...p^.  An  inverse  perfect  shuffle  has  been 
accomplished. 

Theorem  VIII. 12;  The  emulator  network  can  simulate  a  perfect  shuffle  of 
n*k 

size  2"  in  n-k-1  passes,  0  £  k  <  n. 

Proof;  The  simulation  algorithm  is  similar  to  that  of  Theorem  VIII. 11. 
Adjacent  bits  of  the  PE  address  are  compared  and,  if  different,  the  data 


/ 


246 


1s  routed  as  follows. 

RCR  :=  0; 

for  i  =  n-k-1  step  -1  until  1  do 

if  p._.|  *  p.  then  RCR  :=  p.  '  (PM2_^._^j)  +  p.  '  (PI12^^  j); 

TRANSFER; 

DTRin  :=  DTRout . 

The  case  p.  *  sends  data  from  PE  p^_^ . . .10p._2. • .Pp  to 

p^_^...01p^_2...Pg.  The  case  '  (PM2^j._^j)  sends  data  from  PE 

p^_^...01p._2---Pg  fo  Pn-i***^®Pi_2**'Po’  Successive  steps  of  the 

algorithm  send  data  from  P  =  p^_^...p^Pg  to  PE  P^_^  •  • -P^.I^P^. 

.  ^...p-p  ...  A  perfect  shuffle  of  size  2^  is  accomplished.  CD 

K~d  U  n~K“l 

For  the  PM2I  interconnection  functions,  moving  data  from  PE(P)  to 

PE(Q)  can  be  expressed  as  addition  or  subtraction  to  P  of  various  values 

of  2^,  n  <  i  £  0.  Let  D  =  d^_.^ . . .d.^dg,  d.  <  CO,  1,  -1>  represent  these 

transitions.  D  is  a  base  2  signed  digit  representation  of  the 

difference  between  P  and  Q.  Each  d.  ^  0  represents  an  interconnection 

function.  If  d.  =  then  is  applied  to  the  data  at  stage  i.  If 

d.  =  -1,  then  PM2_.  is  applied  to  the  data  at  stage  i.  If  d.  =  0,  then 

loc  .  (P)  =  loc  .  .  (P) . 
n-i  n-1-1 


Using 

the 

PM2I  functions  of 

the 

emulator  network. 

there  is 

more 

than  one 

path 

from  P  to  Q, 

and 

so  D  is  not  a  unique  number 

.  For 

example. 

for 

N  =  16,  there 

are 

four  different 

signed 

digit 

representations  of  the  path  from  0  to  7.  They  are 


a)  0  1  1  1 


b>  1  0  0  -1 

c)  1  0  -1  1 

d)  1  -1  1  1. 

The  number  of  non-zero  digits  in  each  of  these  representations  is  called 
the  multiplicity.  Form  b  has  the  lowest  multiplicity  (two)  and  is 
called  the  n-digit  minimal  form  of  natural  value  k,  for  n  =  4  and  k  =  7. 

Reitwiesner  CREI60D  has  presented  a  detailed  study  of  the  existence 
and  generation  of  minimal  forms.  These  results  are  summarized  here. 
Let  Ox^_^...x^Xq  be  an  n+1  digit  binary  number  of  natural  value  k. 
Then, 

1)  there  exists  a  unique  canonical  form  ypyp--)  •  •  •  signed  digit 

notation  such  that 

y^i  *  =  0,  0  <  i  <  n, 

and, 

2)  no  other  form  of  this  natural  value  k  has  lower  multiplicity 
than  the  canonical  form. 

A  serial  canonical  recoding  algorithm  gives  the  following  rules  for 
converting  X  =  x^_^..Xp  to  Y  =  y^...yQ,  y.  <  •CO,  1,  -1>.  Beginning  with 
i  =  0,  inspect  x.,  and  a  carry  digit  c..  Use  the  results  to 

generate  y.  and  a  carry  out  c...  to  be  used  at  the  next  step.  The  rules 


248 


for  1=0  step  +1  until  n 


C-.1  +  X.  +  c.)  >  1) 

1+1  1+1  1  1 


0  otherwise 


The  symbol  ”+"  refers  to  addition.  For  example,  the  algorithm  has  the 
effect  of  replacing  the  string  011. ..110  in  X  by  the  string  100. ..0-10 
in  Y. 

In  order  to  move  data  from  PE(P)  to  PE(Q),  consider  the  distance  D 
between  them  such  that  (P  +  D)  modulo  N  =  Q.  Let  D  be  a  positive  binary 
number  d^_^...dQ  that  represents  a  path  from  P  to  Q  utilizing  only  the 
PM2^.  interconnection  functions. 

Lemma  VIII. 2;  Given  D,  a  path  utilizing  the  minimum  number  of  PM2I 
interconnection  functions  from  P  to  Q  can  be  found  by  recoding  D  into  a 
signed  digit  vector  in  canonical  form. 

Proof:  Apply  the  canonical  recoding  algorithm  above  to  D  to  yield 
y^y^_1...yQ.  By  CREI60T,  this  result  is  unique  and  has  the  least 
multiplicity  of  any  form  of  the  natural  value  represented  by  0.  The 
path  from  P  to  Q  is  represented  by 

(P  +  +  y|^_^2*^  ^  +  ...  +  yQ2^>  modulo  N  = 

+  ...  +  yQ2^)  modulo  N  =  Q.  Tl 


Theorem  VIII. 13:  The  emulator  network  performs  a  uniform  shift  of 


2A9 


distance  i.e.^  PE(P)  sends  data  to  PE(P+A  modulo  N)  for  all  P,  0  ^  P 
<  in  K  passes  through  the  network^  where  K  is  the  number  of  nonzero 

digits  in  the  canonical  signed  digit  form  of  A. 

Proof;  The  distance  that  all  PEs  transfer  data  is  A,  and  from  Lemma 

VIII. 2  A  the  can  be  recoded  into  canonical  signed  digit  form  that  has 

the  fewest  number  of  nonzero  digits  of  any  binary  representation  of  A. 
Let  the  recoded  A  be  y  .y  _...y.y_.  The  following  algorithm  performs 

n*i  n*c  I  Ij 

a  uniform  shift  of  A. 

activate  all  PEs, 
for  i  =  n-1  step  -1  until  0 
DTRout  :=  DTRin; 
clear  RCR; 

if  y.  =  1  then  RCR  :=  PM2^ 

+1 

if  y.  =  -1  then  RCR  ;=  PM2 

1  -T 

TRANSFER; 

DTRin  :=  DTRout;  Cl 

Theorem  VIII. 14;  The  emulator  network  can  perform  a  broadcast  function 
in  |n/2|  passes  through  the  network. 

Proof;  The  maximum  number  of  interconnection  functions  that  must  be 
applied  to  send  data  from  PE(P)  to  any  arbitrary  PE(Q)  can  be  found  by 
examining  the  distance  between  P  and  Q.  By  the  canonical  recoding 
algorithm,  any  distance  D  between  P  and  Q  can  be  recoded  into  signed 
digit  form  with  minimal  multiplicity,  and  thus  the  minimal  number  of 
interconnection  functions.  This  result,  Y,  has  the  property, 

*  yi_i  =  0,  0  <  i  <  n. 


For  arbitrary  Q,  the  maximum  multiplicity  of  Y  must  occur  for 
Y  =  (01)^^^  or  for  n  even,  and  for  Y  =  1(01)^'^  1)/2  ^  odd. 

The  maximum  multiplicity  if  any  arbitrary  D  is  |n/2|.  Thus,  an  upper 
bound  for  the  maximum  number  of  interconnection  functions  that  must  be 
applied  to  move  data  from  P  to  Q  is  |n/2|. 

From  this,  then,  the  emulator  network  requires  |n/2|  passes  through 
the  network  to  execute  a  broadcast  function.  At  the  first  pass,  source 
PE(P)  uses  all  2n  -  1  PM2I  functions  to  route  data.  At  each  successive 
pass,  all  PEs  that  have  data  use  all  the  PM2I  interconnection  functions 
to  send  the  data.  Since  many  copies  of  the  same  datum  are  passing 
through  the  network,  collisions  of  data  are  not  considered.  In  |  n/2  | 
the  data  has  reached  PE(Q),  P  <  Q  <  N.  Cl 


passes. 


251 


«r 


VIII. 6.  Full  Access  Networks 

Feierbach  and  Stevenson  rFS77]  introduced  a  Programmable  Switching 

Network  (PSN)  which  is  topologically  equivalent  to  a  rearrangeable 

switching  network  CBE65n  of  size  N  =  2*^.  Figure  VIII. 8  shows  a  PSN  for 

N  =  8  with  the  attendant  controls.  The  2n  -  1  stage  network  is  built 

from  four  function  interchange  boxes.  The  first  n  stages  are  connected 

by  Cube^,  Cube^,  next  n-1  stages  are  the  mirror  image 

of  the  first  n-1-  and  so  are  connected  by  Cube  Cube  ....  Cube_. 

n-2'  n-3'  0 

Theorem  VIII. 15;  The  emulator  network  can  simulate  a  2n-1  stage  PSN  in 
2n-1  stages. 

Proof:  Let  c  be  the  control  for  the  interchange  box  associated  with 
PE(i).  An  algorithm  which  simulates  the  2n-1  Cube  functions  of  a  PSN  is 


253 


activate  all  PEs; 

RCR  :=  0; 

for  1=0  step  +1  until  n-1  do 

if  p.  =0  and  c  =  ("exchange"  or  "uppe>‘  oroadcast") 
then  RCR  :=  PM2^^; 

if  p.  =  1  and  c  =  ("exchange"  or  "Lower  broadcast") 

then  RCR  :=  PM2_.; 

TRANSFER; 

DTRin  ;=  DTRout; 

for  i  =  n-2  step  -1  until  0  do 

if  p.  =  0  and  c  =  ("exchange"  or  "upper  broadcast") 
1 

then  RCR  :=  PM2^.; 

if  p.  =  1  and  c  =  ("exchange"  or  "lower  broadcast") 
then  RCR  ;=  PM2_^.; 

TRANSFER; 


DTRin  :=  DTRout. 


•  VIII. 7.  Partitioning 


In  Chapter  V,  two  types  of  partitioning  were  defined.  It  was  shown 
how  the  recirculating  PM2I  network  can  be  partitioned  under  single  or 
multiple  control  partitioning.  The  PM2I  functions  of  the  emulator 

network  partition  in  the  same  manner.  For  any  recirculating  or 

multistage  network  simulated  above  that  can  be  partitioned,  the  emulator 
network  can  simulate  that  partitioning. 

As  an  example,  consider  partitioning  an  indirect  binary  n-cube 
network  for  N  =  8  into  two  groups  of  four  PEs.  Let  group  A  be  the  PEs 

whose  high-order  address  bits  are  0,  i.e.,  PEs  0,  1,  2,  and  3.  Let 

group  B  be  the  PEs  whose  high-order  address  bits  are  1,  i.e.,  PEs  4,  5, 
6,  and  7.  Figure  V.6  illustrates  the  needed  controls  for  this 
partitioning.  To  simulate  this  partitioning,  the  emulator  network  must 
simulate  the  network  shown  in  Figure  VIII. 9.  An  algorithm  which 
simulates  this  partitioning  is 

activate  all  PEs; 

RCR  :=  0; 

for  i  =  0  step  +1  until  n-2 

calculate  interchange  box  control; 
if  (CONTROL  =  "Exchange")  then  Cube.; 


OTRin  :=  OTRout 


VIII. S,  The  Emulator  Network  and  MIfID  Machines 


Two  recently  proposed  MIMP  (multiple  instruction  stream  -  multiple 
data  stream)  systems  are  CHoPP  and  X-tree.  In  this  section,  the  use  of 
the  emulator  network  to  simulate  the  connection  schemes  of  these  systems 
is  discussed. 

The  CHoPP  (Columbia  Homogeneous  Parallel  Processor)  CSUB77]  is  a 
proposed  MIMD  system.  Figure  VIII. 10  shows  the  physical  configuration 
of  this  system.  Each  applications  processor  has  an  associated  local 
memory.  All  interprocessor  communication  is  handled  by  the 
communications  processors.  The  interconnection  network  is  a 
recirculating  (single  stage)  Cube  network  which  connects  a  processor  to 
all  other  processors  whose  addesses  differ  from  it  in  only  one  bit 
position.  A  communications  processor  views  the  network  as  n  =  log^  N 
lines  for  data  entering  the  node  and  n  Lines  for  data  leaving  the  node. 
Data  are  passed  with  a  destination  address,  and  multiple  passes  through 
the  network  are  made  until  the  data  reach  their  destinations.  At  each 
pass,  the  communications  processor  examining  the  destination  address  of 
the  data  determines  the  next  routing  of  the  data.  Since  more  than  one 
data  item  may  need  to  be  routed  on  the  same  output  line  at  any  one  time, 
the  communications  processor  provides  queues  for  storing  these  data. 


Theorem  VIII. 16;  The  emulator  network  can  simulate  the  method  of  routing 
data  used  in  CHoPP. 

Proof;  Partition  the  emulator  system  into  two  groups  of  PEs,  the 
applications  PEs  and  the  communications  PEs.  Let  the  applications  PEs 


be  those  whose  least  significant  address  bit  is  0,  and  let  the 


Boolean  v-cohe 
^\AJilch}p<j  Neiwork 
Pachei  Commumcaiiorr s 


Figure  VIII. 10:  The  CHoPP  is  a  proposed  MIMD  computer  system.  The 
terconnection  network  routes  data  between  processing  elements  CSUB77] 


communications  PEs  be  those  whose  least  significant  address  bit  is  1 


Data  is  routed  between  applications  PEs  and  communications  PEs  by  Cube^. 
The  communication  PEs  are  connected  using  the  recirculating  Cube  network 
functions^  which  the  emulator  network  can  perform  (Theorem  VIII. 9). 

The  communications  PEs  have  the  ability  to  simulate  cueues  for 
storing  data  and  to  examine  the  destination  tags  to  determine  the  next 
routing.  Since  only  one  DTRin  and  one  DTRout  register  are  available  to 
each  PE,  the  emulator  network  cannot  route  more  than  one  datum  from  a 
communications  PE  or  accept  more  than  one  datum  in  one  CHoPP  time 
period.  Multiple  arrivals  and  departures,  then,  are  simulated  as  sets 
of  single  arrivals  and  departures.  That  is,  a  set  of,  say,  n 
simultaneous  departures  from  a  node  is  simulated  by  the  communications 
PE  of  the  emulator  network  as  n  sequential  departures  from  the  node.  CD 

The  X-tree  CDP783  is  a  proposed  multiprocessor  system  in  which  the 
processors  are  interconnected  in  a  tightly  coupled,  hierarchical,  binary 
tree  structured  network.  Redundant  links  are  added  to  the  binary  tree 
of  PEs  to  provide  potential  fault  tolerant  communications.  Two  types  of 
added  links  are  discussed  in  nSDP78D.  Figure  VIII. 11  shows  a  half-ring 
structure  where  the  links  added  to  the  binary  tree  at  level  i  are 
between  PE<P)  to  PE(P'),  where  for  P  even,  P'  is  the  PE  to  the  "left"  of 
P  on  level  i.  Figure  VIII. 12  shows  a  full-ring  topology,  where  the 
added  links  connect  all  PEs  to  the  PE  on  the  "right"  as  well  as  the  PE 
on  the  "left".  In  the  full-ring  topology,  for  a  5-level  tree  with  - 
1  =  31  PEs,  the  ?E  addresses  and  links  for  the  emulator  network 
simulation  are  shown  in  Figure  VIII. 13.  The  PE  at  level  0,  the  root 


I 

\ 


262 


node  of  the  tree^  has  address  P  =  10000.  It  connects  to  its  left  child^ 
P  -  ,  and  its  right  child^  P  +  2^.  Any  PE  in  level  1  has  address 
P  =  XIOOO.  A  PE  in  level  1  connects  to  its  parent  node,  its  two 
children,  and  to  the  other  PE  at  its  level.  The  parent  node  is  P  +  2^ 
if  the  PE  is  the  left  child  <p^  =  0),  and  the  parent  node  is  P  -  2^  if 
the  PE  is  the  right  child  <p^  =1).  The  PE  addresses  at  level  1  differ 

4 

by  2  .  Note  that  all  connections  are  PM2I  connections  for  the  PEs 
selected  for  the  tree. 

Theorem  VIII .17;  The  emulator  network  can  simulate  the  interprocessor 
connections  of  the  half-ring  and  full-ring  topology  of  the  X-tree 
multiprocessor  system. 

Proof;  Since  the  half-ring  topology  is  a  subset  of  the  full-ring 

topology,  it  is  sufficient  to  show  that  the  emulator  network  can 

simulate  the  interprocessor  connections  of  the  full-ring  topology. 

For  N  =  2*^  PEs  of  the  emulator  network,  an  n  level  X-tree  can  be 

constructed  that  has  N-1  PEs.  In  general,  at  level  i,  0  £  i  <  n,  any  PE 

i  1  •  i 

at  that  level  has  address  P  =  X  10  .  It  connects  to  five  other 

PEs.  Its  parent  PE  is  P  +  2*^  ^  ’  if  p^_^  =  0,  and  P  -  2*^  ^  ^  if  p^_ 

.  =  1.  Its  left  child  is  PE(P)  -  2^  ’  Its  right  child  is 

1 

PE(P)  +  2*^  ^  It  connects  to  the  two  PEs  adjacent  to  it  at  level  i, 

P  +  2*^  ’  and  P  -  2^  All  the  interconnections  are  Pf12l 

interconnection  functions,  and  so  the  emulator  network  can  simulate 
their  effects  in  one  pass  through  the  network  for  each  interconnection 
function.  CD 


263 


VIII. 9.  Conclusions 

The  emulator  network^  with  N  inputs  and  outputs,  can  be  built  from 

N(2  1092  N  +  1)  tri-state  buffers,  for  each  bit  of  network  width.  It 

can  simulate  the  following  n  =  log2  N  stage  networks  in  n  passes: 

Pease's  indirect  binary  n-cube,  Goke  and  Lipovski's  n-stage  SW  banyon 

and  CC  banyon,  Satcher's  STARAN  flip  network  with  flip  or  shift  control, 

Siegel  and  Smith's  generalized  cube,  n-stage  Shuffle-Exchange,  Lang  and 

Stone's  multistage  Shuffle-Exchange  with  "simplified  control,"  Lawrie's 

omega  with  destination  tag  control,  Lawrie's  omega  with  broadcast 

capabilities,  Keng's  data  manipulator,  and  Siegel  and  Smith's  augmented 

data  manipulator.  In  one  pass,  the  emulator  can  simulate  an 

interconnection  function  of  the  following  recirculating  (single  stage) 

networks:  Shuffle-Exchange,  Cube,  Plus-Minus  2^  (PM2I),  and  Illiac.  It 

can  simulate  the  n-stage  Shuffle  -  No  Shuffle  -  Exchange  (SNSE)  network 

in  at  most  2n  passes  and  Feierbach  and  Stevenson's  2n-1  stage 

Programmable  Switching  Network  (PSN)  in  at  most  2n-1  passes.  It  can 

n”k 

simulate  a  shuffle  or  an  inverse  shuffle  of  size  2  in  n-k-1  passes. 
The  emulator  network  can  also  simulate  the  partitioning  of  any  of  the 
above  networks  that  have  been  shown  to  be  partitionable. 

The  flexibility  of  the  emulator  network  is  seen  in  its  ability  to 
simulate  many  different  control  schemes  for  many  types  of  networks. 
This  flexibility  comes  from  the  types  of  interconnection  functions 
included  and  from  the  independent  function  control  that  allows  each  PE 
to  execute  any  of  the  2n  intercon.iection  functions  of  the  network 
independently  of  any  other  PEs.  Given  the  logic  functions  at  each  node 
of  a  test  network  to  be  emulated,  the  PEs  at  each  node  of  the  emulator 


264 


network  can  calculate  these  logic  functions  and  control  the  emulator 
network  so  that  it  responds  in  the  same  manner  as  the  test  network.  The 
emulator  network  can  also  be  used  to  test  for  the  effects  of  faulty 
nodes  in  the  network.  A  given  node  of  the  emulator  network  can  be 
assigned  to  fail  in  a  predetermined  manner.  The  effects  of  the  failure 
on  the  remainder  of  the  network  can  then  be  observed.  Because  of  this 
flexibility^  the  emulator  network  can  be  a  powerful  tool  for  evaluating 
proposed  SIMD  machine  interconnection  networks  and  their  control 
schemes. 


IX.  CONCLUSIONS 


This  work  is  a  study  of  interconnection  networks  for  SIMD  machines, 
with  special  emphasis  placed  on  reconfigurable,  partitionable,  multiple 
SIMD  systems.  The  results  have  both  immediate  practical  applications 
and  theoretical  implications  about  the  nature  of  multiprocessor 
interconnection  systems. 

The  partitioning  of  interconnection  networks  (Chapter  V)  is 
essential  knowledge  for  designing  partitionable  multiple  SIMD  computer 
systems.  In  exploring  this  topic,  information  was  gained  about  the 
types  of  data  transfers  that  networks  can  perform  and  the  limits  imposed 
by  the  interconnection  functions  of  the  networks  and  the  type  of  control 
scheme  used.  For  example,  for  N  =  2^,  the  indirect  binary  n-cube 
network  with  individual  interchange  box  control  can  be  partitioned  into 
2  smaller,  complete,  independent  indirect  binary  n-cube  networks  when 
all  PE  address  bits  within  a  partition  have  n-r  address  bits  in  common. 
The  STARAN  flip  network  1j  topologically  equivalent  to  the  indirect 
binary  n-cube  CSIS78D.  The  flip  control  allows  the  flip  network  to 
partitioned  only  under  single  control  partitioning,  where  all  partitions 
must  perform  the  same  data  transfer.  With  the  more  flexible  shift 
control,  the  flip  network  cannot  be  partitioned  under  multiple  control 
partitioning,  where  all  partitions  may  perform  different  data  transfers. 


266 


Under  single  control  partitioning,  where  all  partitions  perform  the  same 
data  transfer,  the  shift  control  can  only  partition  the  flip  network 
into  2  groups  if  n-r  of  the  most  significant  or  the  least  significant 
address  bits  are  the  same  for  all  PEs  of  the  partition.  Consider  also 
the  generalized  cube  network  with  individual  interchange  box  control. 
The  generalized  cube  can  be  partitioned  under  the  restriction  that  n-r 
PE  address  bits  are  the  same  for  all  PEs  in  the  partition.  The 
Augmented  Data  Manipulator,  which  is  a  more  powerful  interconnection 
network  than  the  generalized  cube  (Chapter  VI)  can  be  partitioned  only 
when  some  set  of  the  least  significant  address  bits  are  common  to  all 
PEs  in  a  partition.  Future  work  in  this  area  should  consider  the  entire 
computer  system  and  the  influence  the  interconnection  network  exerts 
upon  it.  A  partitionable  system  has  more  overhead  in  operating  system 
software  than  a  conventional  SIMD  machine.  Also,  extra  hardware  costs 
will  be  incurred  to  control  the  partitioned  network. 

The  emulator  network  is  presented  as  a  practical  hardware  design 
aid.  The  theorems  of  Chapter  VIII  are  also  theoretical  results  about 
the  nature  of  the  interconnection  networks  that  were  discussed.  The 


PM2I  functions. 

together 

with 

independent  function 

control 

and  the 

computation  facilities  of 

the 

PEs,  can  be  used  to 

simulate  many 

interconnection 

networks 

from 

the  literature,  as 

well  as 

useful 

interconnections 

such  as 

the 

perfect  shuffle,  the 

inverse 

perfect 

shuffle,  and  broadcasts.  When  designing  a  prototype  for  a  partitionable 
SIMD  parallel  processing  system,  incorporation  of  the  emulator  network 
into  the  prototype  would  allow  different  network  connections  and  control 
schemes  to  be  tested  and  evaluated.  The  algorithms  that  use  the 


.hi 


AD-A078  690  PURDUE  UNIV  LAFAYETTE  IN  SCHOOL  OF  ELECTRICAL  EN6INEERIN6  F/6  9/2 

DESIGN  AND  ANALYSIS  OF  INTERCONNECTION  NETWORKS  FOR  PARTITIONAB— ETC(U) 
AUG  79  S  D  SMITH  ,  H  J  SIEGEL  AF0SR-78-S581 

UNCLASSIFIED  TR-EE79-39  _  AFOSR-TR-79-1R72  wi 

4of4 

AD 

AO 786 90 


267 


emulator  network  to  simulate  other  networks  reveal  information  about  the 
characteristics  of  the  networks  being  simulated.  Thus,  this  chapter 
provides  both  a  practical  and  theoretical  analysis. 

The  study  of  the  Augmented  Data  Manipulator  in  Chapter  VI 
enumerates  some  of  the  differences  between  the  ADM  and  multistage  cube 
networks.  Theorems  are  presented  which  show  how  and  why  the  ADM  can 
perform  a  perfect  shuffle  function  in  one  pass  through  the  network,  but 
not  an  inverse  perfect  shuffle  or  a  bit  reversal  in  a  single  pass.  Some 
fault  tolerant  properties  of  the  network  are  presented.  The  Cube 
functions  and  the  uniform  shift  functions  are  used  to  show  how  a  given 
interconnection  may  have  several  different,  equivalent  paths  through  the 
ADM  network.  The  relationship  between  the  ADM  and  Inverse  ADM  is  also 
analysed. 

Chapter  IV  developed  techniques  for  the  analysis  of  combinational 
logic  multistage  networks,  pipelined  multistage  networks,  and 
recirculating  networks.  Detailed  comparisons  were  made  between 
combinational  logic  multistage  and  pipelined  multistage  networks  for 
networks  of  equal  cost  and  networks  that  require  equal  time  to  complete 
a  given  data  transfer.  Examples  were  presented  which  illustrate  the 
behavior  of  these  networks  for  a  given  cost  and  for  a  given  number  of 
data  items  to  be  transferred.  These  analyses  will  aid  a  system  designer 
in  choosing  a  network  implementation  which  will  be  cost-effective  for 
the  intended  application  of  the  system. 

These  results  were  applied  in  Chapter  VII  to  three  parallel 
algorithms  for  image  processing.  For  each  algorithm,  the  impact  on  the 
computation  time  of  different  choices  of  interconnection  networks  was 


268 


shown.  When  uniform  shifts  of  the  data  from  one  PE  to  another  are 
present  in  the  algorithm^  the  recirculating  cube  is  a  poor  choice  of 
interconnection  network.  A  binary  addition  algorithm  must  be  used  to 
calculate  each  data  transfer.  Multistage  cube  and  PM2I  networks, 
however,  can  perform  such  uniform  shifts  in  one  use  of  the  network.  The 
Illiac  network  is  useful  for  passing  data  to  PEs  in  the  immediate 
neighborhood  of  PE  P,  i.e.,  PEs  (P+1)  modulo  N,  (P-1)  modulo  N,  (P+-/N) 
modulo  N,  and  (P — /N)  modulo  N.  When  data  transfers  are  among  PEs 
outside  a  limited  neighborhood,  as  in  the  histogram  formation  algorithm 
presented,  the  Illiac  network  performs  poorly.  When  transfers  of  blocks 
of  data  are  common  in  an  algorithm,  a  pipelined  multistage  network  shows 
its  worth.  Future  work  in  this  area  should  examine  more  algorithms  for 
the  types  of  data  transfers  present.  From  this,  models  of  data 
transfers  can  be  developed,  and  techniques  for  evaluation  of  the  impact 
of  the  interconnection  network  on  the  algorithm  can  be  applied. 

Tools  were  developed  which  can  be  used  by  the  architect  of  parallel 
computer  systems  to  design  intelligently  an  interconnection  network. 
Underlying  theoretical  aspects  of  interconnection  networks  presented 
here  further  expand  the  knowledge  of  the  types  of  interprocessor 
connections  that  networks  can  accomplish.  The  analysis  of  the  structure 
of  networks  has  resulted  in  criteria  which  can  be  applied  to  determine 
the  usefulness  of  a  given  network  for  a  specific  task. 


269 


LIST  OF  REFERENCES 


CDAR68]  Barnes,  G.  H.,  Brown,  R.  M.,  Kato,  M.,  Kuck,  0.  J.,  Slotnick,  D. 

L. ,  and  Stokes,  R.  A.,  "The  IlLiac  IV  computer,"  IEEE  Transactions  on 
Computers,  vol.  C”17,  no.  8,  August  1968,  pages  746-757. 

rBA68D  Batcher,  K.  E.,  "Sorting  networks  and  their  applications," 
Proceedings  of  the  Spring  Joint  Computer  Conference  1968,  AFIPS 
Conference  Proceedings,  volume  7>2  (1968),  pages  307-314. 

rBA753  Batcher,  K.  E.,  "The  multidimensional  access  memory  in  STARAN," 
IEEE  Transactions  on  Computers,  vol.  c-26,  no.  2,  February  1977,  pages 
174-177. 

CBA76]  Batcher,  K.E.  ,  "The  flip  network  ir.  STARAN,"  Proceedings  of  the 
1976  International  Conference  on  Parall^  Processing,  August  1976,  pages 
65-71. 

CBAU743  Bauer,  L.  H.,  "Implementation  of  data  manipulating  functions  on 
the  STARAN  associative  array  processor,"  Lecture  Notes  in  Computer 
Science  24,  Proceedings  of  the  1974  Sagamore  Computer  Conference  on 
Parallel  Processing,  T.  Feng,  editor,  Springer-Ver lag,  Berlin,  August 
1974,  pages  209-227. 

CBE65]  Benes,  V.  E.,  Mathematical  Theory  of  Connecting  Networks  and 
Telephone  Traffic,  New  York;  Academic,  1965. 

CBFHP791  Briggs,  F.,  Fu,  K.  S.,  Hwang,  K.,  and  Patel,  J.,  "PM4:  A 
reconfigurable  multiprocessor  system  for  pattern  recognition  and  image 
processing,"  1979  National  Computer  Conference,  AFIPS  Conference 
Proceedings,  volume  48,  June  1979,  pages  255-265. 

CB0U723  Bouknight,  W.  J.,  benenberg,  S.  A.,  Mclntrye,  D.  E.,  Randall,  J. 

M. ,  Sameh,  A.  H.,  and  Slotnick,  D.  L.,  "The  Illiac  IV  System," 
Proceedings  of  the  IEEE,  vol.  60,  no.  4,  April  1972,  pages  369-388. 

C9S783  Bogdanowic2,  J.  F.,  and  Siegel,  H.  J.,  "A  partitionable  multi- 
microprogrammable-microprocessor  system  for  image  processing," 
Proceedings  of  the  IEEE  Computer  Society  Workshop  on  Pattern  Recognition 
and  Artificial  Intelligence,  April  1978,  pages  141-144. 

CC0L77D  Cordelia,  L.,  Duff,  M.  J.  B.,  and  Levialdi,  S.,  "Thresholding:  a 
challenge  for  parallel  processing,"  Computer  Graphics  and  Image 
Processing,  vol.  6,  1977,  pages  207-220. 


270 


CC0L78]  Cordelia^  L.  P.,  Duff^  M.  J.  and  Levialdi,  S.^  "An  analysis 
of  computational  cost  in  image  processing:  a  case  study,"  IEEE 
Transactions  on  Computers,  vol.  c-27,  no.  10,  October  1978,  pages 
904-910. 

rCGY74T  Couranz,  G.  R.,  M.  S.  Gerhardt,  and  C.  J.  Young,  "Programmable 
RADAR  signal  processing  using  the  RAP,"  Lecture  Notes  in  Computer 
Science  24,  Proceedings  of  the  1974  Sagamore  Computer  Conference  on 
Parallel  Processing,  T.  Feng,  editor,  Spr inger-Verlag,  Berlin,  August 
1974,  pages  37-52. 

CDP78n  Despain,  A.  M.  and  Patterson,  D.  A.,  "X-tree:  a  tree  structured 
multi-processor  computer  architecture,"  Fi fth  Annual  Symposium  on 
Computer  Architecture  (ACM  SIGARCH  Newsletter,  vol.  6,  no.  7,  April 
1979),  April  1978,  pages  144-151. 

CFE74D  Feng,  T.  ,  "Data  manipulating  functions  in  parallel  processors 
and  their  implementations,"  IEEE  Transactions  on  Computers,  vol.  C-23, 
no.  3,  March  1974,  pages  309-318. 

CFL66D  Flynn,  M.  J.,  "Very  high-speed  computing  systems,"  Proceedings 
of  the  IEEE,  vol,  54,  no.  12,  December  1966,  pages  1901-1909. 

CFS77]  Feierbach,  G.,  and  Stevenson,  D.,  "A  feasibility  study  of 
programmable  switching  networks  for  data  routing,"  Phoenix  Project, 
memorandum  number  003,  Institute  for  Advanced  Computation,  Sunnyvale, 
California. 

CFU78D  Fu,  K.  S.,  "Special  computer  architectures  for  pattern 
recognition  and  image  processing  -  an  overview,"  1973  National  Computer 
Conference,  AFIPS  Conference  Proceedings,  volume  47  (1978),  pages 
1003-1013. 

CGEN78]  Gentleman,  W.  M.,  "Some  complexity  results  for  matrix 
computations  on  parallel  processors,"  Journal  of  the  ACM,  vol.  25,  no. 
1,  January  1978,  pages  112-115. 

I:GL73D  Goke,  L.  R.,  and  Lipovski,  G.  J.,  "Banyon  networks  for 
partitioning  multiprocessor  systems,"  First  Annual  Symposium  on  Computer 
Architecture  (ACM  SIGARCH  Newsletter,  vol.  2,  no.  4,  December  1973), 
December  1973,  pages  21-28. 

CG0K761  Goke,  L.  Rodney,  "Banyon  networks  for  partitioning 
multiprocessor  systems,"  PhD  thesis.  Department  of  Electrical 
Engineering,  University  of  Florida,  June  1976. 

rG0L6in  Golumb,  S.  W.,  "Permutations  by  cutting  and  shuffling,"  SIAM 
Review,  vol  3,  October  1961,  pages  293-297. 


/ 


CGR76]  Graham,  Marvin  Lowell,  "An  array  computer  for  the  class  of 
problems  typified  by  the  general  circulation  model  of  the  atmosphere," 
PhD  thesis.  Department  of  Computer  Science,  University  of  Illinois  at 
Urbana-Champaign,  January  1976. 

tIHEL77D  Heller,  D.,  "Minimal  parallelism  for  computations  under  time 
constraints,"  Symposium  on  High  Speed  Computer  and  Algorithm 
Organization,  D.  J.  Kuck,  0.  H.  Lawrie,  A.  H.  Sameh,  editors.  Academic 
Press,  New  York,  April  1977,  pages  321-322. 

[!HI6723  Higbie,  L.  C.,  "The  Omen  computer  associative  array  processor," 
Compcon  72,  IEEE  Computer  Society  Conference  Proceedings,  September 
1972,  pages  287-290. 

CKST73]  Kogge,  P.  M.,  and  Stone,  H.  S.,  "A  parallel  algorithm  for  the 
efficient  solution  of  a  general  class  of  recurrence  equations,"  IEEE 
Transactions  on  Computers,  vol.  c-22,  no.  8,  August  1973,  pages  786-793. 

r.KUC77]l  Kuck,  0.  J.,  "A  survey  of  parallel  machine  organization  and 
programming,"  ACM  Computing  Surveys,  vol.  9,  March  1977,  pages  29-59. 

rLAN76]  Lang,  T.,  "Interconnections  between  processors  and  memory 
modules  using  the  shuffle-exchange  network,"  IEEE  Transactions  on 
Computers,  vol.  c-26,  no.  5,  May  1976,  pages  496-503. 

CLAST763  Lang,  T.,  and  Stone,  H.  S.,  "A  shuffle-exchange  network  with 
simplified  control,"  IEEE  Transactions  on  Computers,  vol.  c-25,  no.  1, 
January  1976,  pages  55-65. 

CLAW73]  Lawrie,  Duncan  H.,  "Memory  -  processor  connection  networks,"  PhD 
thesis.  Department  of  Computer  Science,  University  of  Illinois  at 
Urbana-Champaign,  February  1973. 

CLAW75]  Lawrie,  D.  H.,  "Access  and  alignment  of  data  in  an  array 
processor,"  IEEE  Transactions  on  Computers,  vol.  c-24,  no.  12,  December 
1975,  pages  1145-1155. 

CLT77D  Lipovski,  G.  T.,  and  Tripathi,  A.,  "A  reconf igurable 
varistructure  array  processor,"  1977  International  Conference  on 
Parallel  Processing,  August  1977,  pages  165-174. 

CNU77T  Nutt,  G.  J.,  "Microprocessor  implementation  of  a  parallel 
processor,"  Fourth  Annual  Symposium  on  Computer  Architecture  (ACM 
SIGARCH  Newsletter,  vol.  5,  no.  7,  March  1977),  March  1977,  pages 
147-152. 

r0PT71D  Opferman,  D.  C.,  and  N.  T.  Tsao-Wu,  "On  a  class  of 
rearrangeable  switching  networks,"  Bell  System  Technical  Journal,  vol. 
50,  May-June  1971,  pages  157^-1618. 


272 


CPAT79]  Patel^  J.  "Processor-memory  interconnections  ♦nr 
multiprocessors^"  Sixth  Annual  Symposium  on  Computer  Architecture  (ACH 
SIGARCH  Newsletter,  vol.  7,  no.  6,  April  1979),  April  1979,  pages 
168-177. 

rPEA77D  Pease,  M.  C.  ,  "The  indirect  binary  n-cube  microprocessor 
array,"  IEEE  Transactions  on  Computers,  vol.  c-26,  no.  5,  May  1977, 
pages  458-47?. 

CREI603  Reitwiesner,  C.  W.,  "Binary  Arithmetic,"  in  Advances  in 
Computers,  volume  1,  Academic  Press,  New  York,  ftew  York,  1960. 

CR0K76]  Rosenfeld,  A.,  and  Kak,  A.  C.,  Digital  Picture  Processing, 
Acedemic  Press,  New  York,  1976. 

rSAM77]l  Sameh,  A.  H.,  "Numerical  parallel  algorithms  -  A  survey," 
Symposium  on  High  Speed  Computer  and  Algorithm  Organization,  D.  J.  Kuck, 
D.  H.  Lawrie,  A.  H.  Sameh,  editors.  Academic  Press,  New  York,  April 
1977,  pages  207-226. 

CSDP78]  Sequin,  C.  H.,  Despain,  A.  M.,  and  Patterson,  D.  A., 
"Communication  in  X-tree,  a  modular  multiprocessor  system,"  ACM  78 
Proceedings,  December  1978,  pages  194-203. 

CSIE75]  Siegel,  H.  J.,  "Analysis  Techniques  for  SIMD  machine 
interconnection  networks  and  the  effects  of  processor  address  masks," 
1975  Sagamore  Computer  Conference  on  Parallel  Processing,  August  1975, 
pages  106-109. 

CSIE77a]  Siegel,  H.  J.  ,  "Analysis  techniques  for  SIMD  machine 
interconnection  networks  and  the  effects  of  processor  address  masks," 
IEEE  Transactions  on  Computers,  vol.  c-26,  no.  2,  February  1977,  pages 
153-161. 

CSIE77b3  Siegel,  H.  J.,  "Interconnection  networks  and  masking  schemes 
for  single  instruction  stream  -  multiple  data  stream  machines,"  PhD 
thesis.  Department  of  Electrical  Engineering  and  Computer  Science, 
Princeton  l^iiversity.  May  1977. 

CSIE78a]  Siegel,  H.  J.,  "Preliminary  design  of  a  versatile  parallel 
image  processing  system,"  Proceedings  of  the  Third  Biennial  Conference 
on  Computing  in  Indiana,  April  1978,  pages  11-25. 

CSIE78bD  Siegel,  H.  J,,  "Partitionable  SIMD  computer  system 
interconnection  network  universality,"  Proceedings  of  the  Sixteenth 
Annual  Allerton  Conference  on  Communication,  Control,  and  Computing, 
October  1978,  pages  586-595. 

CSIE79aD  Siegel,  H.  J.,  "Interconnection  networks  for  SIMD  machines," 
IEEE  Computer  Magazine,  vol.  12,  no.  6,  June  1979,  pages  57-65. 


273 


CSIE79bD  SiegeL,  M.  J.,  "A  model  of  SIMD  machines  and  a  comparison  of 
various  interconnection  networks,"  to  appear  in  IEEE  Transactions  on 
Computers. 

CSIS781  Siegel,  H.J.,  and  Smith,  S.  D.,  "Study  of  multistage  SIMD 
interconnection  networks,"  Fifth  Annua  I  Symposium  on  Computer 
Architecture  (ACM  SIGARCH  Newsletter,  vol.  6,  no.  7,  April  1979),  April 

8- 10,  1978,  pages  223-229. 

CSM78J  Siegel,  H.  J.,  and  Mueller,  Jr.,  P.  T.,,  "The  organization  and 
language  design  of  microprocessors  for  an  SIMO/MIMD  system,"  Second 
Rocky  Mountain  Symposium  on  Microcomputers;  Systems,  Software, 
Architecture,  August  1978,  Pingree  Park,  Colorado,  pages  311-340. 

IISMSH78D  Siegel,  H.  J.,  Mueller,  Jr.,  P.  T.,  Smalley,  Jr.,  H.  E., 
"Control  of  a  partitionable  multimicroprocessor  system,"  1978 
International  Conference  on  Parallel  Processing,  August  1978,  pages 

9- 17. 

I!SS79D  Siegel,  H.  J.,  Siegel,  L.  J.,  McMillen,  R.  J.,  Mueller,  Jr.,  P. 
T.,  and  Smith,  S.  D.,  "An  SIMD/MIMD  multimicroprocessor  system  for  image 
processing  and  pattern  recognition,"  1979  Conference  on  Pattern 
Recognition  and  Image  Processing,  August  1979. 

CST71D  Stone,  H.  S.,  "Parallel  processing  and  the  perfect  shuffle," 
IEEE  Transactions  on  Computers,  vol.  c-20,  no.  2,  February  1971,  pages 
153-161. 

CST75II  Stone,  H.  S.,  "Parallel  Computers,"  in  Introduction  to  Computer 
Architecture,  Science  Research  Associates,  Inc.,  Chicago,  Illinois, 
1975,  pages  318-374. 

CSU8773  Sullivan,  H.,  and  Bashkow,  T.  R.,  "A  large  scale,  homogeneous, 
fully  distributed  parallel  machine,  I,"  Fourth  Annual  Symposium  on 
Computer  Architecture  (ACM  SIGARCH  Newsletter,  vol.  5,  no.  7,  March 
1977),  March  1977,  pages  105-117. 

CSW78D  Swain,  P.  H.,  "Fundamentals  of  Pattern  Recognition  in  Remote 
Sensing,"  Chapter  3  in  P.  H.  Swain  and  S.  M.  Davis,  editors.  Remote 
Sensing;  The  Quantitative  Approach,  McGraw-Hill,  1978. 

CTI76D  The  TTL  Data  Book  for  Design  Engineers,  Texas  Instruments, 
Dallas,  Texas,  1976. 

CWEN763  Wen,  K.  Y.,  "Interprocessor  connections  —  capabilities, 
exploitation,  and  effectiveness,"  Department  of  Computer  Science, 
University  of  Illinois,  Urbana,  Illinois,  Report  no.  UIUCDCS-R-76-R30, 
October  1976. 


Errata 


- - — _______ 

f>9ur.  (V.I5  c.ptlo„  t.lo„5,  to  with  IV.U  cptlon. 

iv.a.  ooptloo 

"’“™''"'“^t.Ohh.,„„,.to,„„t.w,th„.„t.„,^. 

. . .  ”  **=  '■"♦zf)  -  Q  .00  PH  .  (Q,  .  p. 


